Building a storage abstraction layer

Svedrin

2014-03-02 22:29

During the development process for the next version of openATTIC, we have come across the problem that our previous design — which had been pretty straightforward — was being challenged by the fact that there are lots of ways to architect a storage system, and using filesystems on top of LVM logical volumes shared by CIFS, NFS or FTP was just one of those.

Up until the 1.0.7 release, openATTIC used to operate on LVM volume groups and pretty much didn't care about what happened below those volume groups. On one hand, this was great, because it provided what we needed and otherwise got out of the way, allowing us to architect storage systems to deliver the space and performance requirements our customers needed. So for a while, this worked well.

The downside to this approach is that it's not all that flexible. For instance, automatically deploying volumes mirrored by DRBD has been a challenge, mainly because it didn't fit well into the design. Only by letting the LogicalVolume model know that it may or may not be mirrored did we succeed in making this work, but that approach clearly violates good software design. There had to be a better way.

Then along came ZFS, which completely shredded our design by replacing the very key component it relied upon: With ZFS deployed directly onto your disks, there's simply no such thing as a volume group or logical volumes. So clearly, we had hit a hard limit of what our design was capable of. Something had to be done.

Finding abstractions

When it comes to designing a system, the hard part is not building something that handles what you know needs to be handled. It's building a system that is flexible enough in order to be extended in ways that, today, you don't even know exist. When it comes to programming, we've been doing that with object oriented programming for decades now by thinking about what's going on in the real world in terms of abstractions: By looking at what you've got, try to find the concepts that are behind the things you find in the real world, then design your software based upon those concepts. This way, exchanging the technology isn't going to cause a problem as long as the concept still applies.

Sadly, that's easier said than done, especially if you have a multitude of objects. We have LVM volume groups and logical volumes; ZFS pools, subvolumes and snapshots; same goes for Btrfs; volumes mirrored using DRBD; Hardware RAID; Software RAID; SSD caches; single standard disks; practically everything in all kinds of combinations. Then there are file systems that have more features than others: Some take care of deduplication, redundancy and CIFS/NFS sharing themselves; others are better suited for running VMs. By stacking components on top of one another you get different sets of features, depending on what those components are and the order in which you stack them. And then there's still stuff being developed right now that is yet to be released to the public, and we'd like our new design to be able to handle whatever crazy stuff people choose to do in the future.

So what are the basic concepts behind all those things here?

Disks, RAIDs, LVs and DRBD mirrors all provide a block device. These may or may not show up in the system as /dev/something.
Block devices can be grouped into pools and divided into partitions. Those partitions may or may not be block devices or file systems.
ZFS, XFS, BTRFS, ext4 and other file systems provide something to be mounted and files put therein. They may or may not reside on a block device. (Ultimately, they all do, but they don't have to be on a disk directly. They may well be in a Zpool residing on 10 disks.) They are addressed using their mount point, e.g. /media/something.

Having found those concepts, we labeled them BlockVolume, VolumePool and FileSystemVolume, respectively, and started implementing them. Everything went well at first: The existing code could be migrated nicely, and we found that building a user interface around this model was easier. But most importantly, it allowed for everything to be in one place, regardless of what the object actually was — whereas before, we had to have one GUI page for LVM, one for ZFS, one for DRBD and so forth, which is confusing and not at all fun to use.

But once again, ZFS challenged the design when we took an LVM logical volume and formatted it as a zpool using ZFS. Now we had a BlockVolume (the LV) that had been turned into a VolumePool (the zpool), which was also a FileSystemVolume because it could be mounted. So we had three objects, while in reality we were talking about one single entity which just happened to implement all three of our concepts. The design was lacking a way to express that, because it didn't distinguish the information that "there is something" from "it's a VolumePool". So finally, we added the StorageObject model, which captures that last piece of information.

In the new, abstraction-based model, the information is split up into the following parts:

There's an object called zmirror_apt.
It is block-based, because it is an LV named /dev/vgfaithdata/zmirror_apt.
It is also a volume pool, namely a zpool named zmirror_apt.
It is also a file system volume, namely a ZFS volume mounted at /media/zmirror_apt.

This is the design that openATTIC 1.1.0 is going to ship with.

The whole development process is a perfect example of how your code speaks to you. While this may sound crazy, think about it this way: When we first had to break the rules in order to get the DRBD mirror running, we knew something was wrong with the LVM-only approach. By the time ZFS hit, we had already prepared for having to rethink the whole thing. With the new design, the code eventually ended up screaming for the StorageObject class to be added. So we did, and everything worked out. I'm pretty confident that this design will serve us well for quite some time, simply because the code tells me it will.

The battle™

When building software, there's always an ongoing battle between flexibility and standardization. I think I can identify a couple of stages during a development process:

Initial design

Let's face it, your initial design is going to suck at some point. You're somewhat new to the field, and if not, you still don't quite know how to express the concepts of whatever you're building a tool for in code, and which abstractions you're going to need. But because you're gonna have to start somewhere, you're compromising on something that's good enough to last a couple years, but that's not going to last forever.

Actually, what you're doing at this stage is standardization: You're identifying a certain, narrow set of features that you're going to support and that's it, allowing for an implementation to be done in a limited amount of time. Thereby, you're commiting to building live systems in a way that fits the model (anything else would be a stupid thing to do).
Shadow of doubt

So, you've made it past the first release and everything's running smoothly — until someone tries to setup things a bit differently than what you usually do, and they have a good reason for that too. You can probably map what they're intending onto the existing structure, but it's always going to be a workaround and the next guy is already waiting with another slightly-but-crucially-different setup.

Problem is: In the first stage, you standardized too much, and while it did work out for a while, it doesn't anymore. But still: Don't go about fixing everything too quickly either. You have to be able to tell the full extent of the suckage in your initial design, or else your fix will just be a hack that accumulates even more problems instead of fixing them.
Redesign

Someday, you'll be at the point where you can't take it anymore and start redesigning stuff. You feel perfectly confident that you know what needs to be done, and you build a shiny new thing that gets rid of all the limitations the old design had. It's thought through, it's extensible and everyone is happy.

Basically what you're doing now is finding abstractions that, when done right, fit for everything you have to do as long as the underlying concepts stay the same. This way, you'll be flexible to do whatever the user wants you to do.
Bad awakening

But this causes a whole different set of issues. Now that the software can do whatever you want it to do, you have to know what it is you want it to do, and this can prove to be trickier than you'd think.

First of all, the real world is a pretty complex thing. There's all kinds of stuff you can do that has all kinds of implications. Do you want to group your hosts by the task they're doing (webservers, database servers, file servers) or by the people that operate them? Maybe both? Maybe something completely different? And who is to be notified about an outage? Does that depend solely on the kind of host that went down, or on the specific service? Do you want file-based, block-based or object-based storage? Which protocol do you want to use to access it? And what the hell did you do to make the damn thing so slow?

Up until stage 2, these questions used to be answered for you by what the software used to support. If the software didn't allow things to be done, you were limited in the set of choices you had, so you had some kind of guideline on what to do. But we threw out all these limitations in the redesign phase, because they always sucked for some people.

So what do we do now?
Back to the defaults

Well, we do what we've always done: We standardize, but this time we do it in the software's configuration instead of in its code. Stuff that used to be a bad idea still is a bad idea, but we added support for it anyways because in this one certain situation, it was the right thing to do — but in other situations it is still the wrong thing to do.

Oddly enough, at this stage, the whole process starts over. The configuration standards ("best practices" or whatever you want to call them) that you define now are results of phase one, and they will undergo the same process — just that this time, changes won't require the source code to be changed.

At some point, you have to take a decision on what you want the software to be. For instance, comparing openATTIC and OpenStack: OpenStack is radically focused on standardization and on hiding implementation details from the user, because it is targeted at users who do not want to have to think about the other options that they could have had. So by only offering a very limited set of features, OpenStack gets the job done. For openATTIC, this wouldn't be the right approach: We have to support more than just OpenStack, and there's even more than one way to set up OpenStack too.

However, having completed the first iteration is what enables you to think freely, without being limited by what the software used to provide. Most importantly, it leads to a stable API that does not require any major changes anymore. This means you can use that API for all kinds of setups and depend on it when building the system you'd like to have, while resting assured that you'll still be able to stay compatible with products developed in the future.