Monitoring is one of those fields where the temptation for tech ejac is pretty strong. You can collect data from everywhere, using not only pretty standard scripts that query the Linux kernel, but also using home-brewn hardware that uses the likes of a Raspberry Pi to measure the water temperature in your aquarium. You can draw graphs, where you can get to build your own parser for a minilanguage that allows to graph wr_sectors[sct/s] * 512[B/sct] / wr_ios[IO/s] and have the language infer that this calculation yields bytes per IO operation. You can do loads of fancy BI stuff using CrystalReports and friends to generate reports. Oh, do make sure you can handle a flaky connection without losing measurements. And while you're at it, cluster the whole thing, and be sure to have a system underneath it that can easily take the load.
All of this is actually pretty fun. That's why it is so tempting: We're nerds, so doing this stuff is just plain awesome. But once you get to the point where you figured out how to measure things, there's a much tougher question waiting: What do you measure, and what for?
Answering that question, it quickly becomes clear that frankly, you don't have a clue, even with things that seem obvious. For example, just about every monitoring system is able to monitor the amount of RAM that's in use on a system — but what do you actually gain from that?
Wikipedia defines a system monitor as an entity that stores state information. That part holds true for what most, if not all monitoring solutions do: They can tell you the state a system was in at a given time, which will be useful when debugging a system's configuration or looking for the cause of a failure. But is that all there's to it, or is there a way a monitor can be of use while nothing is wrong?
Wait a second: How do we even know whether or not something is wrong?
Of course, there are situations in which this is obvious. If the water in the aforementioned aquarium is boiling, your fish are gonna have a bad time. I don't know about you, but I would consider that "something is wrong". However, how do you know they're not having a bad time right now?
Well, last time the water was 23°C, they were doing fine, and the guy at the shop said that's the temperature it should be. They might be able to cope with one or two degrees above or below that, but that's about all the risk I'm going to take. So while it may be acceptable for the temperature to vary half a degree depending on the room temperature and time of year, I'm going to make sure it does not vary more than that. So if it does, I need to know, so I can do something about it.
All in all, this is a pretty well-defined scenario in which a monitoring system is easy to set up in a useful way. This is because there's a well known metric and people are experienced when it comes to defining what's ok and what's not. While you may not have that experience yourself, it is readily available on the net and in stores. Translate that experience into numbers, and you can trust that your fish won't be cooked or frozen to death unnoticed.
Unfortunately, most of the time, it's not that simple — either because the available experience is severely limited, or the system is way too complex.
I'm building storage and virtualization clusters for a living. Those systems aren't exactly complex, the basic design hasn't changed all that much in the last years. What makes this hard is the fact that in order to run a multitude of virtual machines and guest operating systems with performance that deserves the name, you must not fuck anything up, because every single component is able to flush your system's performance down the drain.
That being said, monitoring such a system should be obvious. We know a set of fuckup criteria for each component, so let's check them, right?
Well, no. Every single one of those criteria is a configuration issue. Configure the system correctly, and it'll work; don't, then it won't. Nothing to monitor there. So then, what do you monitor? Right now, we're back where we started.
In a situation like this, you fall back to the defaults. Monitoring CPU and RAM usage as well as disk and network throughput is never a bad idea, so let's start there. But in order to provide a real benefit, you need to do something that's actually relevant.
Once again, there's an obvious answer: "I need to know when my system is down or else I'll lose money". Thing is, when a system that critical has an outage, then believe me, you will know. There will be emails sent and phone calls made. Your monitoring system's alerts are only adding to the noise in such a case. So what you're really looking for is a system that can reliably predict an outage before it happens, hopefully giving you enough time to prevent it or, at least, get on a plane to some far-away island in the caribbean.
Basically, that's what warning thresholds are for. Those need to be set to a value low enough that the system can still handle the load without upsetting the users (and thereby making your phone ring), while being generous enough not to be triggered needlessly ten times a day. Systems that generate lots of false alarms will end up being ignored, those that don't alert when they should end up being utterly useless. And then there are times when the thresholds will be deliberately exceeded, e. g. while a backup process is running, where the normal thresholds don't apply anymore. (In fact, during such times, the threshold not being exceeded tends to indicate a problem with the backup process not running.) So unfortunately, just defining a simple threshold value ain't gonna cut it. But how do you define a threshold that works then?
The aquarium example shows that the number one thing you need is experience, which is precisely what you don't have when you're setting up monitoring of a freshly configured system. This means that you will need to continually validate and update those values as time passes. Not only is that time consuming — a system that only lets you define a single threshold might not even allow you to configure it in a useful way, because there's no way to account for special circumstances like backup processes using up all the bandwidth.
All of this is what motivated me to start working on Fluxmon. For starters, I got a math book about statistics, read it, understood some of it, and got to the point where I was quite confident that I had an idea of where to start. Then I discovered that RRDtool already features mechanisms to detect strange behaviour in close-to-real time.
This feature is actually pretty cool. RRDtool keeps track of the history of a measured value, and calculates its average and standard deviation. It then uses those values to predict a confidence interval in which the next value is expected to be, and flags values that aren't. You can then use this data to raise alerts. The most awesome part is that the Holt-Winters algorithm that is used by RRDtool is able to take a seasonal coefficient into account, which is also automatically recorded — so this way, you can even account for nightly backups, and you can leave the busywork of figuring out what's a failure to the system.
But even with a system like this, there is stuff left to be desired.
First of all, such a system doesn't know what's important and what can be ignored. Everything is of the same importance.
Second, it doesn't alert "something is wrong", but instead "something is not the way it used to be". That means if you have something like a next-to-never-used backup storage volume lying around somewhere, the system will soon assume that the amount of expected activity on that volume is zero. Now someone comes along and accesses it. The system will see something that's not zero and send you an alert.
The same thing holds true for error conditions. If you ignore an error long enough instead of fixing it, eventually, Fluxmon will adjust itself to the error condition and treat it as the normal case. (So when you do fix the error, you will get alerts again.)
Third, Fluxmon's sensors are designed to monitor everything they can. This means that when you do have an outage, quite a lot of things will be different from usual, hence you get a multitude of alerts from one single physical event. Fluxmon treats them all as equally important, so figuring out what's actually wrong is left to the user.
Addressing these issues is all but easy.
Ad 1: Importance is a matter of context. Different people care about different things, and while I will want to know pretty much everything about my home boxes, I'm just the storage guy at work.
Ad 2: I'm not sure this can be fixed at all. If you want the system to be self-learning, you can't blame it for learning. One could add an alert acknowledgment system that causes the system to only learn when it's not currently alerting or something, but if you want to prevent it from learning, you'll have to prevent it from processing incoming values altogether, which means you get a hole in your graphs (and your math, actually). That's not exactly a better solution.
Ad 3: This is the classic problem of recognizing the wave while being the water. When everything around you starts moving, how do you know what's the cause and what's a symptom? Combined with the butterfly effect, the cause can be microscopic, so I'm not sure there even is a way to do that. At the very least, you'd have to know about a list of possible causes, check them, and see if you can find anything that fits. This list will never be complete, so there's no guarantee that this will actually work; but I guess this is one of those problems of which you'll only want to solve 80%.
I'm going to keep working on Fluxmon and see where the journey takes me. I do have a couple ideas in mind that I want to try, so there's still some progress to be made. Honestly, I have no idea what to expect — some of the problems I discussed here are of a pretty fundamental nature, and I'm not sure they can be solved by just technical measures (if at all). Still, I might end up building a tool that makes life easier, which would be a pretty cool thing to do.