Fun with ntiles

In the last couple posts, I extracted quite a load of data from Fluxmon's PostgreSQL database. At the end of the day, I'd like to see whether or not the thing we're measuring looks broken. How can we get to that point?

To be frank, I only have a very vague idea of how to do that myself. It involves binomial and normal distributions, confidence intervals, standard deviations and stuff like that, all of which I have not yet fully understood.

http://i0.kym-cdn.com/photos/images/original/000/234/739/fa5.jpg

That being said, I thought I'd first check out what this guy said: That one can actually estimate a confidence interval simply by calculating \(\mu \pm \sqrt{n}\). With \(\mu\) being the average of our recorded values and \(n\) being the theoretical maximum I calculated a few posts ago, this sounds almost too easy. For a nicer square root, let's just assume \(n = 1600\), because then \(\sqrt{n} = 40\). Postgres said the average is 302, so the confidence interval for 95% would range from 262 to 342.

Similarly, we can calculate one for 68% using \(\mu \pm \frac{1}{2}\sqrt{n}\), or \(302 \pm 20\), which is 282 to 322.

Almost too easy. I just have no idea if it means anything.

Luckily, the guy also said something about the standard deviation, namely that if you know it, you can also use it to deduce a confidence interval. So let's do that and compare results, shall we?

>>> mu    = 302.109410805295
>>> sigma =  83.1838532450487
>>> conflevels = (1.28, 1.64, 1.96, 2.58, 3)
>>>
>>> for k in conflevels:
...     print k, mu - k * sigma, mu + k * sigma
...
1.28    195.634078652     408.584742959
1.64    165.687891483     438.530930127
1.96    139.069058445     465.149763166
2.58     87.4950694331    516.723752178
3        52.5578510701    551.66097054

The mu and sigma values stem from PostgreSQL, the confidence levels are blindly copy-pasted from here. I've taken the liberty to add a few spaces into the output that Python had not put there to make it a bit more readable.

Looking at these numbers, the first thing I noticed was that those intervals are huge. Then I wondered if one of the values we've actually measured has ever been outside of these limits, so hey postgres, gimme the min and max please, willya?

 check_id |       min        |       max
----------+------------------+------------------
      287 | 229.136365378731 | 1281.18419789548
(1 row)

The minimum value is well inside all of those ranges, but the max sure is not. So, putting it all into a nice little table:

 int(value)  |  Percentile  |  Comment
-------------+--------------+----------------------------
        52   |              |  k = 3    (99.7%)
        87   |              |  k = 2.58 (99.0%)
       139   |              |  k = 1.96 (95%)
       165   |              |  k = 1.64 (90%)
       195   |              |  k = 1.28 (80%)
       229   |     1%l      |  Measured Minimum
       262   |    15%l      |  µ -  √n  (95%)
       282   |    44%l      |  µ - ½√n  (68%)
       286   |    50%l      |  Median
       302   |    72%l      |  Average (µ, Arithm. Mean)
       322   |    87%l      |  µ + ½√n  (68%)
       342   |    92%l      |  µ +  √n  (95%)
       408   |    96%l      |  k = 1.28 (80%)
       438   |    97%l      |  k = 1.64 (90%)
       465   |    97%l      |  k = 1.96 (95%)
       516   |    98%l      |  k = 2.58 (99.0%)
       551   |    98%l      |  k = 3    (99.7%)
      1281   |   100%l      |  Measured Maximum

(I just invented the %l symbol. % seemed wrong, but I wanted to have some symbol, so there.)

Now I'm wondering what this tells me.

First of all, none of the values we measured in this data set belong to an actual failure, but well, the math can't know that. So whatever alert we raise based on this information, it's a false positive, at least for the plus side. For instance, using the \(\mu + \sqrt{n}\) boundary, we'd raise an alert in the top eight percentiles, which seems to me like we'd have an eight percent rate of false positives using this boundary.

On the other hand, values below 229 have never ever occurred (no downtime has happened yet). So if the value now drops to zero, that's definitely something I'd like to know about, and it is well outside any of the confidence intervals, so they all would raise a valid alert in this case.

But this only seems to work up to the point where the system has been broken long enough to affect the boundaries, so it might be a good idea to exclude them from the calculations for those boundaries. But wouldn't that make the system blind to changes, unable to adopt? And can we even use the \(\mu \pm \sqrt{n}\) technique which seems to produce the smaller boundaries? Are there cases where that solution is invalid? Are there cases where its results are worse than the \(\mu \pm k \sigma\) method? Is there a certain minimum of records we need to have for \(\mu\) and \(\sigma\) to even be usable? RRDtool seems to gather a minimum of data before producing any results in its HWPREDICT algorithm. Why is that so? How big is that minimum? How much less reliable does all this become if we don't have that much data?