We’ve been looking deeply into Ceph Storage latency, comparing
BlueStore and FileStore, and looking at methods how to get below the
magic 2ms write latency mark in our Proxmox clusters. Here’s what we
found.
The endeavour was sparked by our desire to run ZooKeeper on our
Proxmox Clusters. ZooKeeper is highly sensitive to IO latency: If writes
are too slow, it will log messages like this one:
fsync-ing the write ahead log in SyncThread:1 took 1376ms which will adversely effect operation latency.File size is 67108880 bytes. See the ZooKeeper troubleshooting guide
Subsequently, ZooKeeper nodes will consider themselves broken and
restart. If the thing that’s slow is your Ceph cluster, this means that
all three VMs will be affected at the same time, and you’ll end up
losing your ZooKeeper cluster altogether.
We mitigated this by moving ZooKeeper to local disks, and getting rid
of the Ceph layer in between. But that is obviously not a satisfactory
solution, so we’ve spent some time looking into Ceph latency.
Unfortunately, there’s not a lot of advice to be found other than
“buy faster disks”. This didn’t seem to cut it for us: Our hosts were
reporting 0.1ms of disk latency, while the VMs measured 2ms of latency.
If our hosts had weighed in at 1.8ms, I’d be willing to believe that we
have a disk latency issue - but not with the discrepancy that we were
seeing. So let’s dive in and see if we can find other issues.
Read more…