Ceph CRUSH map with multiple storage tiers

Svedrin

2018-01-24 16:40

At work, we're running a virtualization server that has two kinds of storage built-in: An array of fast SAS disks, and another one of slow-but-huge SATA disks. We're running OSDs on both of them, and I wanted to distinguish between them when creating RBD images, so that I could choose the performance characteristics of the pool. I'm not sure if this post is outdated by now (Jan 2018), there's a "class" thing in crush map all of a sudden. However, here's what we're currently running.

We currently have two pools:

pool 2 'rbd-rmvh-sas'  replicated size 2 min_size 1 crush_rule 1 [...]
pool 4 'rbd-rmvh-sata' replicated size 2 min_size 1 crush_rule 2 [...]

rbd-rmvh is the one on the SAS disks, rbd-rmvh-sata resides on the sata disks. This placement is achieved through careful placement of the nodes in the CRUSH map, combined with two specifically-crafted CRUSH rules that place replicas accordingly.

Our Ceph CRUSH Tree currently looks like this:

ID  CLASS WEIGHT   TYPE NAME                 STATUS REWEIGHT PRI-AFF
-10       10.00000 root sata
-11       10.00000     rack rmvh-sata
 -8        5.00000         host rmvh002-sata
  4   hdd  5.00000             osd.4             up  1.00000 1.00000
-12        5.00000         host rmvh003-sata
  6   hdd  5.00000             osd.6             up  1.00000 1.00000
 -1        6.00000 root default
 -7        6.00000     rack rmvh-sas
 -5        3.00000         host rmvh002
  3   hdd  3.00000             osd.3             up  1.00000 1.00000
 -9        3.00000         host rmvh003
  5   hdd  3.00000             osd.5             up  1.00000 1.00000

Note that each OSD resides on a RAID array, not just a single disk. RAID controllers have caches. Caches eliminate latency. We hate latency, so we love caches, hence we use RAID. This means we only have two OSDs per node.

The trick is having a second hierarchy of buckets in the CRUSH map: One for each kind of storage. If we were to add SSDs into the picture, we'd make another hierarchy with an -ssd prefix. Unfortunately, if you try to do this with the default settings in ceph.conf, you'll find that OSDs move themselves into the host=$HOSTNAME container for the node where they're running. This is usually cool, but not in this scenario, because we want the OSDs in the -sata hosts to stay there and not move into the SAS hierarchy upon every restart, probably causing them to discard data and replicate different data that's not meant for them.

So, for this kind of setup, you'll want to have the following option in ceph.conf:

[global]
    osd crush update on start = false

Now, all that's left to do is create a ruleset that only chooses OSDs from the SAS hierarchy:

rule rmvh-sas-ruleset {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take rmvh-sas
        step chooseleaf firstn 0 type host
        step emit
}

And another one for the SATA hierarchy:

rule rmvh-sata-ruleset {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take rmvh-sata
        step chooseleaf firstn 0 type host
        step emit
}

And create pools that use those rulesets. Done!