XFS modifications for SMR drives

Hannes Reinecke <hare@xxxxxxx> · Sun, 22 Mar 2015 11:53:11 +0100

Hi Dave,

I finally got around to read your paper, and here are some
suggestions/fixes:

> This assumes a userspace ZBC implementation such as libzbc will do
> all the heavy lifting work of laying out the structure of the
> filesystem, and that it will perform things like zone write pointer
> checking/resetting before the filesystem is mounted.

The prototype implementation I did mapped the 'RESET WRITE POINTER'
command to the 'discard' functionality, so if mkfs issues a 'discard' on
the disk we'll be fine.
The representation of the zone tree is still be discussed, but the block
layer will have knowledge of the zone layout, and this will be exported,
too. Presumeably via sysfs.

> Recent research has shown that 6TB seagate drives have a 20-25GB
> CMR zone, which is more than enough for our purposes. Information
> from other vendors indicate that some drives will have much more
> CMR, hence if we design for the known sizes in the Seagate drives
> we will be fine for other drives just coming onto the market
> right now.

Please, cut out this paragraph. _NONE_ of the disks I've been working
with had such small zones, and even the Seagate one had identical zone
sizes. While it might be true, the information above is restricted to a
single drive type from a single manufacturer, and is in no way relevant
to any other SMR drive.

The implementation I've seen all have an identical zone size, with a CMR
zone at the beginning and the end of the disk (primarily to support GPT
partition tables). There are provisions in the spec to have the last
zone of a different size (to accomodate various disk sizes), but I've
been advocating hard to have all zones of identical sizes.
Let's see...

> For host managed/aware drives, we are going to assume that we can
> use this area directly for filesystem metadata - for our own
> mapping tables and things like the journal, inodes, directories
> and free space tracking. We are also going to assume that we can
> find these regions easily in the ZBC information, and that they
> are going to be contiguous rather than spread all over the drive.

Yes, this is will be true. Either the device is host-aware (in which
case it doesn't matter as the firmware will cover-up any botched
alignment from our side), or the device is host-managed, which out of
necessity will be having at least one CMR zone at the start, just to not
annoy customers :-)

> The log doesn't actually need to track the zone write pointer,
> though log recovery will need to limit the recovery head to the
> current write pointer of the lead zone.  Modifications here are
> limited to the function that finds the head of the log, and can
> actually be used to speed up the search algorithm.

Hmm. Can't we always align the log to start at the _start_ of the zone?
IE restrict ourselves to the simple case of having two (or more) log
zone, one active and one inactive, and always have the head of the log
at the start of the zone?

> What we need is a mechanism for tracking the location of zones
> (i.e. start LBA), free space/write pointers within each zone,
> and some way of keeping track of that information across mounts.
> If we assign a real time bitmap/summary inode pair to each zone,
> we have a method of tracking free space in the zone. We can
> use the existing bitmap allocator with a small tweak (sequentially
> ascending, packed extent allocation only) to ensure that newly
> written blocks are allocated in a sane manner.

That mechanism is already implemented in my prototype; the request queue
contains an rbtree storing the zone layout and the write pointer.

> If we arrange zones into zoen groups, we also have a method for
> keeping new allocations out of regions we are re-organising. That
> is, we need to be able to mark zone groups as "read only" so the
> kernel will not attempt to allocate from them while the cleaner
> is running and re-organising the data within the zones in a zone
> group. This ZG also allows the cleaner to maintain some level of
> locality to the data that it is re-arranging.

The current ZBC spec already has provisions for the 'read-only' zone,
so we could set the zone state to 'read-only' in the in-kernel zone
representation for these kind of operations. Or even add an internal
zone state here.

> Mkfs is going to have to integrate with the userspace zbc libraries
> to query the layout of zones from the underlying disk and then do
> some magic to lay out al the necessary metadata correctly. I don't
> see there being any significant challenge to doing this, but we
> will need a stable libzbc API to work with and it will need ot be
> packaged by distros.

I'd rather define a kernel API here, as the zone information will
need to present in the kernel, too. At least for host-managed devices;
for host-aware we might get away with not having it in-kernel,
but then we'll be having to have an in-kernel implementation anyway we
might as well use it for both types.

> == Quantification of Random Write Zone Capacity

That will pose a problem. The drives I've seen have a single CMR zone in
front and another one at the end. So asking for 2G CMR is a bit much
here. Things are not set in stone, but I doubt we'll be getting a
significant increase here.

Nevertheless, I'll put my feelers out. Having a 2G CMR zone would indeed
help us for btrfs, too ...

> Ideally, we won't need a zbc interface in the kernel, except to
> erase zones. I'd like to see an interface that doesn't even require
> that. For example, we issue a discard (TRIM) on an entire  zone and
> that erases it and resets the write pointer. This way we need no new
> infrastructure at the filesystem layer to implement SMR awareness.
> In effect, the kernel isn't even aware that it's an SMR drive
> underneath it.

While this is certainly appealing, I doubt we can get away with it.
To ensure strict sequential ordering we would need to keep track of the
write pointer, which in turn requires us to have a zone tree, too.
But I might be persuaded otherwise here.

Other than that, really nifty.

I even managed to wrangle it into my presentation here at CLT; it went
without much discussion. For the xfs side at least. Plenty of discussion
points with btrfs and ext4, but from the xfs front there seemed to be
only one knowledgeable person in the room.
And Joorg Schilling (sigh), so we got to discuss plenty of zfs stuff :-(
But I've got him to admit that even he didn't know zfs _that_ well.
Small victory on my side :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html