why = WHY_NOT; (Posts about storage)

Ceph Bluestore/Filestore latency

Svedrin — Mon, 01 Feb 2021 10:08:06 GMT

We’ve been looking deeply into Ceph Storage latency, comparing BlueStore and FileStore, and looking at methods how to get below the magic 2ms write latency mark in our Proxmox clusters. Here’s what we found.

The endeavour was sparked by our desire to run ZooKeeper on our Proxmox Clusters. ZooKeeper is highly sensitive to IO latency: If writes are too slow, it will log messages like this one:

fsync-ing the write ahead log in SyncThread:1 took 1376ms which will adversely effect operation latency.File size is 67108880 bytes. See the ZooKeeper troubleshooting guide

Subsequently, ZooKeeper nodes will consider themselves broken and restart. If the thing that’s slow is your Ceph cluster, this means that all three VMs will be affected at the same time, and you’ll end up losing your ZooKeeper cluster altogether.

We mitigated this by moving ZooKeeper to local disks, and getting rid of the Ceph layer in between. But that is obviously not a satisfactory solution, so we’ve spent some time looking into Ceph latency.

Unfortunately, there’s not a lot of advice to be found other than “buy faster disks”. This didn’t seem to cut it for us: Our hosts were reporting 0.1ms of disk latency, while the VMs measured 2ms of latency. If our hosts had weighed in at 1.8ms, I’d be willing to believe that we have a disk latency issue - but not with the discrepancy that we were seeing. So let’s dive in and see if we can find other issues.

Speeding up Ceph recovery

Svedrin — Mon, 07 Jan 2019 11:44:04 GMT

Note to self: Here’s the command to speed up Ceph recovery by backfilling more than one PG at a time:

ceph tell osd.* injectargs '--osd_max_backfills 16'

Filesystem Tuning

Svedrin — Tue, 03 Jul 2018 20:14:53 GMT

A while back, I promised I’d write about file system tuning someday. Since it hasn’t really happened yet, I thought I’d do it now.

Setting up Ceph FS on a Proxmox cluster

Svedrin — Mon, 02 Jul 2018 12:12:38 GMT

Proxmox apparently does not yet support running CephFS, but it can be done using a bunch of manual steps. Here’s how.

Read more… (2 min remaining to read)

Manually creating a Ceph OSD

Svedrin — Wed, 24 Jan 2018 15:44:15 GMT

When setting up the Ceph Server scenario for Proxmox, the PVE guide suggests to use the pveceph createosd command for creating OSDs. Unfortunately, this command assumes that you want to dedicate a complete harddrive to your OSD and format it using ZFS. I tend to disagree: Not only do I prefer RAIDs because their caches eliminate latency. I also always have LVM in between so that I'm flexible with the disk space allocation. And I'm not really a huge fan of ZFS ever since it bit me, albeit they fixed that issue by now. Still, I'm staying with my trusty XFS.

That of course means that I'll have to create my OSDs differently because pveceph createosd` isn't going to work. Here's how I do it.

Ceph CRUSH map with multiple storage tiers

Svedrin — Wed, 24 Jan 2018 15:40:01 GMT

At work, we're running a virtualization server that has two kinds of storage built-in: An array of fast SAS disks, and another one of slow-but-huge SATA disks. We're running OSDs on both of them, and I wanted to distinguish between them when creating RBD images, so that I could choose the performance characteristics of the pool. I'm not sure if this post is outdated by now (Jan 2018), there's a "class" thing in crush map all of a sudden. However, here's what we're currently running.

Read more… (2 min remaining to read)

Locating dying disks in LSI RAID using StorCLI

Svedrin — Wed, 12 Apr 2017 11:14:31 GMT

I often find myself in need of locating disks in an LSI RAID that are not quite dead yet, but in the process of dying. Google knows how to do that using MegaCli, but I totally hate that tool and want to do the same thing using storcli instead, which is a bit less insane. Here's how.

Read more… (2 min remaining to read)

Storage fun with Steam

Svedrin — Tue, 22 Nov 2016 18:29:31 GMT

Yesterday evening, I enjoyed a nice game of Dishonored 2. After dealing with the Crown Killer in a non-lethal and somewhat stealthy way, I shut off my PC, went to sleep, and set to continue my endeavour tonight. When I started my PC and fired up Dishonored again, my PC completely froze. I hit the reset button, started Task Manager before starting Dishonored, and I discovered that Steam chose to completely smash my harddrive to pieces. Here's what I saw.

Read more… (2 min remaining to read)

Ceph CRUSH map editing script

Svedrin — Wed, 13 Jul 2016 14:43:50 GMT

If you're working with Ceph, you'll find yourself updating the CRUSH map sooner or later. For that, you regularly need to get the current map, decompile it, edit it, comile it and upload it again. Here's a little script that makes this easier.

Disk alignment and caching

Svedrin — Sun, 29 Nov 2015 19:29:41 GMT

So now that we've conducted measurements and run benchmarks, what do we do with the results? How does the system need to be built to deliver good performance? What options do we have?