On Sun, Mar 22, 2015 at 11:53:11AM +0100, Hannes Reinecke wrote: > Hi Dave, > > I finally got around to read your paper, and here are some > suggestions/fixes: > > > This assumes a userspace ZBC implementation such as libzbc will do > > all the heavy lifting work of laying out the structure of the > > filesystem, and that it will perform things like zone write pointer > > checking/resetting before the filesystem is mounted. > > The prototype implementation I did mapped the 'RESET WRITE POINTER' > command to the 'discard' functionality, so if mkfs issues a 'discard' on > the disk we'll be fine. > The representation of the zone tree is still be discussed, but the block > layer will have knowledge of the zone layout, and this will be exported, > too. Presumeably via sysfs. No way. We're already talking about tens of thousands of zones per disk, and in future hundreds of thousands of zones. sysfs is an *awful* interface for extracting that amount of information from the kernel. This needs to be a structured binary interface, not a sysfs interface. > > Recent research has shown that 6TB seagate drives have a 20-25GB > > CMR zone, which is more than enough for our purposes. Information > > from other vendors indicate that some drives will have much more > > CMR, hence if we design for the known sizes in the Seagate drives > > we will be fine for other drives just coming onto the market > > right now. > > Please, cut out this paragraph. _NONE_ of the disks I've been working > with had such small zones, and even the Seagate one had identical zone > sizes. While it might be true, the information above is restricted to a > single drive type from a single manufacturer, and is in no way relevant > to any other SMR drive. It's indicative of the problem space, as documented in public research. I know that newer drives are slightly different, but the basic layout is the same but with larger capacity regions. I can delete it, but then there is nothing to explain *why* we make the assumption that the CMR region is contiguous and located at the outer edge of the drive and is large enough for our purposes.... > The implementation I've seen all have an identical zone size, with a CMR > zone at the beginning and the end of the disk (primarily to support GPT > partition tables). There are provisions in the spec to have the last > zone of a different size (to accomodate various disk sizes), but I've > been advocating hard to have all zones of identical sizes. > Let's see... Don't really care from the XFS perspective - if we have different zone sizes, we can handle it easily at mkfs time. > > The log doesn't actually need to track the zone write pointer, > > though log recovery will need to limit the recovery head to the > > current write pointer of the lead zone. Modifications here are > > limited to the function that finds the head of the log, and can > > actually be used to speed up the search algorithm. > > Hmm. Can't we always align the log to start at the _start_ of the zone? > IE restrict ourselves to the simple case of having two (or more) log > zone, one active and one inactive, and always have the head of the log > at the start of the zone? I suspect you misunderstood what the "head" of the log is. It's not the first block of the log, it's the active write pointer i.e. where the next transaction will be written to disk. The start of the log would be aligned to a zone, but it's the zone awareness for making space available through tail pushing and needing to "discard" parts of the log so overwrite can occur that adds lots of nasty complexity to the write path. Recovery, however, is the hard part. Especially the part where we zero the parts of the log that we've recovered so that we don't recovery then a second time if we crash before any other changes are made. That can't be done if the log is in a SMR zone.... i.e. putting the log in SMR zones is possible, but requires significant rework of both the write and recovery algorithms. Far simpler just to locate it in the CMR region and ignore the whole problem..... > > What we need is a mechanism for tracking the location of zones > > (i.e. start LBA), free space/write pointers within each zone, > > and some way of keeping track of that information across mounts. > > If we assign a real time bitmap/summary inode pair to each zone, > > we have a method of tracking free space in the zone. We can > > use the existing bitmap allocator with a small tweak (sequentially > > ascending, packed extent allocation only) to ensure that newly > > written blocks are allocated in a sane manner. > > That mechanism is already implemented in my prototype; the request queue > contains an rbtree storing the zone layout and the write pointer. Allocated/free space tracking needs to be done in a persistent manner in the filesystem - we cannot rely on the dynamic information pulled from the drive matching the internal state of the filesystem after an unclean shutdown (e.g. crash, power fail). i.e. the rbtree is not transactionally synchronised to the filesystem's persistent domain..... > > If we arrange zones into zoen groups, we also have a method for > > keeping new allocations out of regions we are re-organising. That > > is, we need to be able to mark zone groups as "read only" so the > > kernel will not attempt to allocate from them while the cleaner > > is running and re-organising the data within the zones in a zone > > group. This ZG also allows the cleaner to maintain some level of > > locality to the data that it is re-arranging. > > The current ZBC spec already has provisions for the 'read-only' zone, > so we could set the zone state to 'read-only' in the in-kernel zone > representation for these kind of operations. Or even add an internal > zone state here. My comments about Zone group access controls have nothing to do with the disk state - it is filesystem state necessary to prevent the filesystem from allowing new allocations (and therefore writes) to zones that we are actively working to defragment/rewrite. We still need write access to zones in that group to perform these oeprations... > > Mkfs is going to have to integrate with the userspace zbc libraries > > to query the layout of zones from the underlying disk and then do > > some magic to lay out al the necessary metadata correctly. I don't > > see there being any significant challenge to doing this, but we > > will need a stable libzbc API to work with and it will need ot be > > packaged by distros. > > I'd rather define a kernel API here, as the zone information will > need to present in the kernel, too. At least for host-managed devices; > for host-aware we might get away with not having it in-kernel, > but then we'll be having to have an in-kernel implementation anyway we > might as well use it for both types. We don't need kernel ZBC access in XFS if we do the layout and constrain the allocation algorithms correctly. > > == Quantification of Random Write Zone Capacity > > That will pose a problem. The drives I've seen have a single CMR zone in > front and another one at the end. So asking for 2G CMR is a bit much > here. Things are not set in stone, but I doubt we'll be getting a > significant increase here. > > Nevertheless, I'll put my feelers out. Having a 2G CMR zone would indeed > help us for btrfs, too ... 2GB? That's way less than what we know is in these drives, and the numbers I keep hearing for the upcoming generation of host managed drives are around the 10GB CMR per 1TB SMR capacity.... > > Ideally, we won't need a zbc interface in the kernel, except to > > erase zones. I'd like to see an interface that doesn't even require > > that. For example, we issue a discard (TRIM) on an entire zone and > > that erases it and resets the write pointer. This way we need no new > > infrastructure at the filesystem layer to implement SMR awareness. > > In effect, the kernel isn't even aware that it's an SMR drive > > underneath it. > > While this is certainly appealing, I doubt we can get away with it. > To ensure strict sequential ordering we would need to keep track of the > write pointer, which in turn requires us to have a zone tree, too. > But I might be persuaded otherwise here. Keep in mind I'm talking about the XFS implementation, not what the block layer might require... ;) Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html