Nifty queries over time-series data using PostgreSQL
I'm currently looking into loads of maths stuff for Fluxmon, so I eventually reached the point where I'd like to try a few things. I've even collected quite a nice amount of data by now, so I have something to go with. But how to store and analyze it?
On the Chemnitz Linux Days, Hans-Jürgen Schönig held a talk named PostgreSQL: Daten superschnell analysieren which I had attended, so I wanted to get the data into PostgreSQL and try out some of the fancy stuff he showed. Googling for postgres time series, I stumbled upon this blog post showing a coupl'a nice things too, and since the schema used by this guy is radically simple, I just went ahead.
First of all, I created a table that links into FluxMon's check and variable tables and allows me to store results:
CREATE TABLE "monitoring_checkmeasurement" ( "id" serial NOT NULL PRIMARY KEY, "check_id" integer NOT NULL REFERENCES "monitoring_check" ("id") DEFERRABLE INITIALLY DEFERRED, "variable_id" integer NOT NULL REFERENCES "monitoring_sensorvariable" ("id") DEFERRABLE INITIALLY DEFERRED, "measured_at" timestamp with time zone NOT NULL, "value" double precision NOT NULL );
Absolutely no magic there. Then I imported values from my RRDs using a Python script to parse the output of rrdtool fetch
, extended FluxMon's result processing to record results in that table from now on, and by the time I'm writing this, the table has 187373 entries so I can try out some stuff.
First of all, I'm actually astonished that even on this small VM the performance footprint is hardly measureable (FluxMon reports the average IOPS on the server have increased by 2 — two — which means the box is now only slightly less idle), and all the queries I'm running return without me noticing any processing time or anything. I've never regretted moving this system to a netcup vserver because the performance is just plain awesome.
But when we now start actually working with the data, this is where the fun starts.
First of all, this is what some our data looks like (I joined the name column in from the monitoring_sensorvariable table and filtered for a single check):
check_id | measured_at | name | value ----------+------------------------+-------------+--------------------- 287 | 2015-08-16 14:25:03+02 | wr_ticks | 0.171345577611925 287 | 2015-08-16 14:25:03+02 | rd_merges | 0 287 | 2015-08-16 14:25:03+02 | wr_merges | 0 287 | 2015-08-16 14:25:03+02 | rd_ticks | 0.0232418721651974 287 | 2015-08-16 14:25:03+02 | rq_ticks | 0.194627384265379 287 | 2015-08-16 14:25:03+02 | wr_sectors | 29445.2423249549 287 | 2015-08-16 14:25:03+02 | ios_in_prog | 0 287 | 2015-08-16 14:25:03+02 | wr_ios | 268.662597618636 287 | 2015-08-16 14:25:03+02 | rd_sectors | 145.015438354903 287 | 2015-08-16 14:25:03+02 | rd_ios | 2.39606929538118 287 | 2015-08-16 14:25:03+02 | tot_ticks | 0.113347389167615 287 | 2015-08-16 14:30:02+02 | wr_ticks | 0.173318147797764 287 | 2015-08-16 14:30:02+02 | rd_merges | 0 287 | 2015-08-16 14:30:02+02 | wr_merges | 0 287 | 2015-08-16 14:30:02+02 | rd_ticks | 0.0142595933940384 287 | 2015-08-16 14:30:02+02 | rq_ticks | 0.187737961342297 287 | 2015-08-16 14:30:02+02 | wr_sectors | 27165.3231784757 287 | 2015-08-16 14:30:02+02 | ios_in_prog | 2 287 | 2015-08-16 14:30:02+02 | wr_ios | 249.696428706562 287 | 2015-08-16 14:30:02+02 | rd_sectors | 88.2278962058103 287 | 2015-08-16 14:30:02+02 | rd_ios | 1.22167864752295 287 | 2015-08-16 14:30:02+02 | tot_ticks | 0.137322020653263 287 | 2015-08-16 14:35:03+02 | wr_ticks | 0.146852936089821 287 | 2015-08-16 14:35:03+02 | rd_merges | 0 287 | 2015-08-16 14:35:03+02 | wr_merges | 0 287 | 2015-08-16 14:35:03+02 | rd_ticks | 0.0101310021180631 287 | 2015-08-16 14:35:03+02 | rq_ticks | 0.157063814702902 287 | 2015-08-16 14:35:03+02 | wr_sectors | 29537.1896174472 287 | 2015-08-16 14:35:03+02 | ios_in_prog | 0 287 | 2015-08-16 14:35:03+02 | wr_ios | 273.367319635791 287 | 2015-08-16 14:35:03+02 | rd_sectors | 72.9272399510508
For starters, I had recently wondered how many IOPS this system would be capable of at 100% load. (Of course this is only theoretical because it will never actually reach 100%, but just for the fun of it.) This should be easily achievable by dividing wr_ios through wr_ticks, since the latter is a fraction of a second, and thereby we calculate what IOPS would roughly be necessary in order to yank the wr_ticks up to 1.0. But in order to do that, we first need to somehow extract wr_ios and wr_ticks, and preferably those with the same measured_at timestamp.
Well, Postgres can do that pretty easily using the filter
syntax for aggregate functions:
select cm.check_id, cm.measured_at, avg(cm.value) filter (where sv.name = 'wr_ios') as wrios, avg(cm.value) filter (where sv.name = 'wr_ticks') as wr_ticks from monitoring_checkmeasurement cm inner join monitoring_sensorvariable sv on variable_id=sv.id where cm.check_id=287 group by cm.check_id, cm.measured_at order by cm.measured_at ;
This statement works by first grouping all the values by the check they belong to and the time they were measured. Then I use the avg
aggregate function to calculate the average measured value, not because I need the average (there's only one single IOPS value in the grouped time period anyway), but because it allows me to use the filter
keyword to tell Postgres which value
instances I'm interested in. Namely, the one in the row that belongs to the wr_ios
metric. Then I do the same thing with wr_ticks
.
This yields the following result:
check_id | measured_at | wr_ios | wr_ticks ----------+------------------------+------------------+------------------- 287 | 2015-08-16 14:25:03+02 | 268.662597618636 | 0.171345577611925 287 | 2015-08-16 14:30:02+02 | 249.696428706562 | 0.173318147797764 287 | 2015-08-16 14:35:03+02 | 273.367319635791 | 0.146852936089821 287 | 2015-08-16 14:40:02+02 | 253.674634889183 | 0.203743783943429 287 | 2015-08-16 14:45:03+02 | 270.205780314176 | 0.189478882778515
So now that I have the values I need, I can of course do calculations on them, like for instance, divide them:
select cm.check_id, cm.measured_at, avg(cm.value) filter (where sv.name = 'wr_ios') as wr_ios, avg(cm.value) filter (where sv.name = 'wr_ticks') as wr_ticks, avg(cm.value) filter (where sv.name = 'wr_ios') / avg(cm.value) filter (where sv.name = 'wr_ticks') as max_ios from monitoring_checkmeasurement cm inner join monitoring_sensorvariable sv on variable_id=sv.id where cm.check_id=287 group by cm.check_id, cm.measured_at order by cm.measured_at ;
Which yields:
check_id | measured_at | wr_ios | wr_ticks | max_ios ----------+------------------------+------------------+-------------------+------------------ 287 | 2015-08-16 14:25:03+02 | 268.662597618636 | 0.171345577611925 | 1567.95758234929 287 | 2015-08-16 14:30:02+02 | 249.696428706562 | 0.173318147797764 | 1440.68253601417 287 | 2015-08-16 14:35:03+02 | 273.367319635791 | 0.146852936089821 | 1861.50394343215 287 | 2015-08-16 14:40:02+02 | 253.674634889183 | 0.203743783943429 | 1245.06686770683 287 | 2015-08-16 14:45:03+02 | 270.205780314176 | 0.189478882778515 | 1426.04693648117 287 | 2015-08-16 14:50:02+02 | 246.400630315261 | 0.129145522335015 | 1907.93010752688 287 | 2015-08-16 14:55:03+02 | 264.348570000643 | 0.192318868983909 | 1374.53267792855
So the data seems to indicate that theoretically, this box should be maxed out at somewhere between 1200 and 1900 IOPS.
But there's much more to be seen.