Ceph CRUSH map with multiple storage tiers
At work, we're running a virtualization server that has two kinds of storage built-in: An array of fast SAS disks, and another one of slow-but-huge SATA disks. We're running OSDs on both of them, and I wanted to distinguish between them when creating RBD images, so that I could choose the performance characteristics of the pool. I'm not sure if this post is outdated by now (Jan 2018), there's a "class" thing in crush map all of a sudden. However, here's what we're currently running.
We currently have two pools:
pool 2 'rbd-rmvh-sas' replicated size 2 min_size 1 crush_rule 1 [...] pool 4 'rbd-rmvh-sata' replicated size 2 min_size 1 crush_rule 2 [...]
rbd-rmvh
is the one on the SAS disks, rbd-rmvh-sata
resides on the sata disks. This placement is achieved through careful placement
of the nodes in the CRUSH map, combined with two specifically-crafted CRUSH rules that place replicas accordingly.
Our Ceph CRUSH Tree currently looks like this:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -10 10.00000 root sata -11 10.00000 rack rmvh-sata -8 5.00000 host rmvh002-sata 4 hdd 5.00000 osd.4 up 1.00000 1.00000 -12 5.00000 host rmvh003-sata 6 hdd 5.00000 osd.6 up 1.00000 1.00000 -1 6.00000 root default -7 6.00000 rack rmvh-sas -5 3.00000 host rmvh002 3 hdd 3.00000 osd.3 up 1.00000 1.00000 -9 3.00000 host rmvh003 5 hdd 3.00000 osd.5 up 1.00000 1.00000
Note that each OSD resides on a RAID array, not just a single disk. RAID controllers have caches. Caches eliminate latency. We hate latency, so we love caches, hence we use RAID. This means we only have two OSDs per node.
The trick is having a second hierarchy of buckets in the CRUSH map: One for each kind of storage. If we were to add SSDs into the picture,
we'd make another hierarchy with an -ssd prefix. Unfortunately, if you try to do this with the default settings in ceph.conf
, you'll
find that OSDs move themselves into the host=$HOSTNAME
container for the node where they're running. This is usually cool, but not in
this scenario, because we want the OSDs in the -sata
hosts to stay there and not move into the SAS hierarchy upon every restart, probably
causing them to discard data and replicate different data that's not meant for them.
So, for this kind of setup, you'll want to have the following option in ceph.conf
:
[global] osd crush update on start = false
Now, all that's left to do is create a ruleset that only chooses OSDs from the SAS hierarchy:
rule rmvh-sas-ruleset { id 1 type replicated min_size 1 max_size 10 step take rmvh-sas step chooseleaf firstn 0 type host step emit }
And another one for the SATA hierarchy:
rule rmvh-sata-ruleset { id 2 type replicated min_size 1 max_size 10 step take rmvh-sata step chooseleaf firstn 0 type host step emit }
And create pools that use those rulesets. Done!