On Tue, Mar 17, 2015 at 09:25:15AM -0400, Brian Foster wrote: > On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote: > > Hi Folks, > > > > As I told many people at Vault last week, I wrote a document > > outlining how we should modify the on-disk structures of XFS to > > support host aware SMR drives on the (long) plane flights to Boston. > > > > TL;DR: not a lot of change to the XFS kernel code is required, no > > specific SMR awareness is needed by the kernel code. Only > > relatively minor tweaks to the on-disk format will be needed and > > most of the userspace changes are relatively straight forward, too. > > > > The source for that document can be found in this git tree here: > > > > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation > > > > in the file design/xfs-smr-structure.asciidoc. Alternatively, > > pull it straight from cgit: > > > > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc > > > > Or there is a pdf version built from the current TOT on the xfs.org > > wiki here: > > > > http://xfs.org/index.php/Host_Aware_SMR_architecture > > > > Happy reading! > > > > Hi Dave, > > Thanks for sharing this. Here are some thoughts/notes/questions/etc. > from a first pass. This is mostly XFS oriented and I'll try to break it > down by section. > > I've also attached a diff to the original doc with some typo fixes and > whatnot. Feel free to just fold it into the original doc if you like. > > == Concepts > > - With regard to the assumption that the CMR region is not spread around > the drive, I saw at least one presentation at Vault that suggested > otherwise (the skylight one iirc). That said, it was theoretical and > based on a drive-managed drive. It is in no way clear to me whether that > is something to expect for host-managed drives. AFAIK, the CMR region is contiguous. The skylight paper spells it out pretty clearly that it is a contiguous 20-25GB region on the outer edge of the seagate drives. Other vendors I've spoken to indicate that the region in host managed drives is also contiguous and at the outer edge, and some vendors have indicated they have much more of it that the seagate drives analysed in the skylight paper. If it is not contiguous, then we can use DM to make that problem go away. i.e. use DM to stitch the CMR zones back together into a contiguous LBA region. Then we can size AGs in the data device to map to the size of the individual disjoint CMR regions, and we have a neat, well aligned, isolated solution to the problem without having to modify the XFS code at all. > - It isn't clear to me here and in other places whether you propose to > use the CMR regions as a "metadata device" or require some other > randomly writeable storage to serve that purpose. CMR as the "metadata device" if there is nothing else we can use. I'd really like to see hybrid drives with the "CMR" zone being the flash region in the drive.... > == Journal modifications > > - The tail->head log zeroing behavior on mount comes to mind here. Maybe > the writes are still sequential and it's not a problem, but we should > consider that with the proposition. It's probably not critical as we do > have the out of using the cmr region here (as noted). I assume we can > also cleanly relocate the log without breaking anything else (e.g., the > current location is performance oriented rather than architectural, > yes?). We place the log anywhere in the data device LBA space. You might want to go look up what L_AGNUM does in mkfs. :) And if we can use the CMR region for the log, then that's what we'll do - "no modifications required" is always the best solution. > == Data zones > > - Will this actually support data overwrite or will that return error? We'll support data overwrite. xfs_get_blocks() will need to detect overwrite.... > - TBH, I've never looked at realtime functionality so I don't grok the > high level approach yet. I'm wondering... have you considered a design > based on reflink and copy-on-write? Yes, I have. Complex, invasive and we don't even have basic reflink infrastructure yet. Such a solution pushes us a couple of years out, as opposed to having something before the end of the year... > I know the current plan is to > disentangle the reflink tree from the rmap tree, but my understanding is > the reflink tree is still in the pipeline. Assuming we have that > functionality, it seems like there's potential to use it to overcome > some of the overwrite complexity. There isn't much overwrite complexity - it's simply clearing bits in a zone bitmap to indicate free space, allocating new blocks and then rewriting bmbt extent records. It's fairly simple, really ;) > Just as a handwaving example, use the > per-zone inode to hold an additional reference to each allocated extent > in the zone, thus all writes are handled as if the file had a clone. If > the only reference drops to the zoneino, the extent is freed and thus > stale wrt to the zone cleaner logic. > > I suspect we would still need an allocation strategy, but I expect we're > going to have zone metadata regardless that will help deal with that. > Note that the current sparse inode proposal includes an allocation range > limit mechanism (for the inode record overlaps an ag boundary case), > which could potentially be used/extended to build something on top of > the existing allocator for zone allocation (e.g., if we had some kind of > zone record with the write pointer that indicated where it's safe to > allocate from). Again, just thinking out loud here. Yup, but the bitmap allocator doesn't have support for many of the btree allocator controls. It's a simple, fast, deterministic allocator, and we only need it is to track freed space in the zones as all allocation from the zones is going to be sequential... > == Zone cleaner > > - Paragraph 3 - "fixpel?" I would have just fixed this, but I can't > figure out what it's supposed to say. ;) > > - The idea sounds sane, but the dependency on userspace for a critical > fs mechanism sounds a bit scary to be honest. Is in kernel allocation > going to throttle/depend on background work in the userspace cleaner in > the event of low writeable free space? Of course. ENOSPC always throttles ;) I expect the cleaner will work zone group at a time; locking new, non-cleaner based allocations out of the zone group while it cleans zones. This means the cleaner should always be able to make progress w.r.t. ENOSPC - it gets triggered on a zone group before it runs out of clean zones for freespace defrag purposes.... I also expect that the cleaner won't be used in many bulk storage applications as data is never deleted. I also expect tht XFS-SMR won't be used for general purpose storage applications - that's what solid state storage will be used for - and so the cleaner is not something we need to focus a lot of time and effort on. And the thing that distributed storage guys should love: if we put the cleaner in userspace, then they can *write their own cleaners* that are customised to their own storage algorithms. > What if that userspace thing > dies, etc.? I suppose an implementation with as much mechanism in libxfs > as possible allows us greatest flexibility to go in either direction > here. If the cleaner dies of can't make progress, we ENOSPC. Whether the cleaner is in kernel or userspace is irrelevant to how we handle such cases. > - I'm also wondering how much real overlap there is in xfs_fsr (another > thing I haven't really looked at :) beyond that it calls swapext. > E.g., cleaning a zone sounds like it must map back to N files that could > have allocated extents in the zone vs. considering individual files for > defragmentation, fragmentation of the parent file may not be as much of > a consideration as resetting zones, etc. It sounds like a separate tool > might be warranted, even if there is code to steal from fsr. :) As I implied above, zone cleaning is addressing exactly the same problem as we are currently working on in xfs_fsr: defragmenting free space. > == Reverse mapping btrees > > - This is something I still need to grok, perhaps just because the rmap > code isn't available yet. But I'll note that this does seem like > another bit that could be unnecessary if we could get away with using > the traditional allocator. > > == Mkfs > > - We have references to the "metadata device" as well as random write > regions. Similar to my question above, is there an expectation of a > separate physical metadata device or is that terminology for the random > write regions? "metadata device" == "data device" == "CMR" == "random write region" > Finally, some general/summary notes: > > - Some kind of data structure outline would eventually make a nice > addition to this document. I understand it's probably too early yet, > but we are talking about new per-zone inodes, new and interesting > relationships between AGs and zones (?), etc. Fine grained detail is not > required, but an outline or visual that describes the high-level > mappings goes a long way to facilitate reasoning about the design. Sure, a plane flight is not long enough to do this. Future revisions, as the structure is clarified. > - A big question I had (and something that is touched on down thread wrt > to embedded flash) is whether the random write zones are runtime > configurable. If so, couldn't this facilitate use of existing AG > metadata (now that I think of it, it's not clear to me whether the > realtime mechanism excludes or coexists with AGs)? the "realtime device" contains only user data. It contains no filesystem metadata at all. That separation of user data and filesystem metadata is what makes it so appealing for supporting SMR devices.... > IOW, we obviously > need this kind of space for inodes, dirs, xattrs, btrees, etc. > regardless. It would be interesting if we had the added flexibility to > align it with AGs. I'm trying to keep the solution as simple as possible. No alignment, single whole disk only, metadata in the "data device" on CMR and user data in "real time" zones on SMR. > diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc > index dd959ab..2fea88f 100644 Oh, there's a patch. Thanks! ;) Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs