On Mon 29-02-16 02:02:16, Damien Le Moal wrote: > > >On Wed 24-02-16 01:53:24, Damien Le Moal wrote: > >> > >> >On Tue 23-02-16 05:31:13, Damien Le Moal wrote: > >> >> > >> >> >On 02/22/16 18:56, Damien Le Moal wrote: > >> >> >> 2) Write back of dirty pages to SMR block devices: > >> >> >> > >> >> >> Dirty pages of a block device inode are currently processed using the > >> >> >> generic_writepages function, which can be executed simultaneously > >> >> >> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc). > >> >> >> Mutual exclusion of the dirty page processing being achieved only at > >> >> >> the page level (page lock & page writeback flag), multiple processes > >> >> >> executing a "sync" of overlapping block ranges over the same zone of > >> >> >> an SMR disk can cause an out-of-LBA-order sequence of write requests > >> >> >> being sent to the underlying device. On a host managed SMR disk, where > >> >> >> sequential write to disk zones is mandatory, this result in errors and > >> >> >> the impossibility for an application using raw sequential disk write > >> >> >> accesses to be guaranteed successful completion of its write or fsync > >> >> >> requests. > >> >> >> > >> >> >> Using the zone information attached to the SMR block device queue > >> >> >> (introduced by Hannes), calls to the generic_writepages function can > >> >> >> be made mutually exclusive on a per zone basis by locking the zones. > >> >> >> This guarantees sequential request generation for each zone and avoid > >> >> >> write errors without any modification to the generic code implementing > >> >> >> generic_writepages. > >> >> >> > >> >> >> This is but one possible solution for supporting SMR host-managed > >> >> >> devices without any major rewrite of page cache management and > >> >> >> write-back processing. The opinion of the audience regarding this > >> >> >> solution and discussing other potential solutions would be greatly > >> >> >> appreciated. > >> >> > > >> >> >Hello Damien, > >> >> > > >> >> >Is it sufficient to support filesystems like BTRFS on top of SMR drives > >> >> >or would you also like to see that filesystems like ext4 can use SMR > >> >> >drives ? In the latter case: the behavior of SMR drives differs so > >> >> >significantly from that of other block devices that I'm not sure that we > >> >> >should try to support these directly from infrastructure like the page > >> >> >cache. If we look e.g. at NAND SSDs then we see that the characteristics > >> >> >of NAND do not match what filesystems expect (e.g. large erase blocks). > >> >> >That is why every SSD vendor provides an FTL (Flash Translation Layer), > >> >> >either inside the SSD or as a separate software driver. An FTL > >> >> >implements a so-called LFS (log-structured filesystem). With what I know > >> >> >about SMR this technology looks also suitable for implementation of a > >> >> >LFS. Has it already been considered to implement an LFS driver for SMR > >> >> >drives ? That would make it possible for any filesystem to access an SMR > >> >> >drive as any other block device. I'm not sure of this but maybe it will > >> >> >be possible to share some infrastructure with the LightNVM driver > >> >> >(directory drivers/lightnvm in the Linux kernel tree). This driver > >> >> >namely implements an FTL. > >> >> > >> >> I totally agree with you that trying to support SMR disks by only modifying > >> >> the page cache so that unmodified standard file systems like BTRFS or ext4 > >> >> remain operational is not realistic at best, and more likely simply impossible. > >> >> For this kind of use case, as you said, an FTL or a device mapper driver are > >> >> much more suitable. > >> >> > >> >> The case I am considering for this discussion is for raw block device accesses > >> >> by an application (writes from user space to /dev/sdxx). This is a very likely > >> >> use case scenario for high capacity SMR disks with applications like distributed > >> >> object stores / key value stores. > >> >> > >> >> In this case, write-back of dirty pages in the block device file inode mapping > >> >> is handled in fs/block_dev.c using the generic helper function generic_writepages. > >> >> This does not guarantee the generation of the required sequential write pattern > >> >> per zone necessary for host-managed disks. As I explained, aligning calls of this > >> >> function to zone boundaries while locking the zones under write-back solves > >> >> simply the problem (implemented and tested). This is of course only one possible > >> >> solution. Pushing modifications deeper in the code or providing a > >> >> "generic_sequential_writepages" helper function are other potential solutions > >> >> that in my opinion are worth discussing as other types of devices may benefit also > >> >> in terms of performance (e.g. regular disk drives prefer sequential writes, and > >> >> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper > >> >> driver. > >> >> > >> >> For a file system, an SMR compliant implementation of a file inode mapping > >> >> writepages method should be provided by the file system itself as the sequentiality > >> >> of the write pattern depends further on the block allocation mechanism of the file > >> >> system. > >> >> > >> >> Note that the goal here is not to hide to applications the sequential write > >> >> constraint of SMR disks. The page cache itself (the mapping of the block > >> >> device inode) remains unchanged. But the modification proposed guarantees that > >> >> a well behaved application writing sequentially to zones through the page cache > >> >> will see successful sync operations. > >> > > >> >So the easiest solution for the OS, when the application is already aware > >> >of the storage constraints, would be for an application to use direct IO. > >> >Because when using page-cache and writeback there are all sorts of > >> >unexpected things that can happen (e.g. writeback decides to skip a page > >> >because someone else locked it temporarily). So it will work in 99.9% of > >> >cases but sometimes things will be out of order for hard-to-track down > >> >reasons. And for ordinary drives this is not an issue because we just slow > >> >down writeback a bit but rareness of this makes it non-issue. But for host > >> >managed SMR the IO fails and that is something the application does not > >> >expect. > >> > > >> >So I would really say just avoid using page-cache when you are using SMR > >> >drives directly without a translation layer. For writes your throughput > >> >won't suffer anyway since you have to do big sequential writes. Using > >> >page-cache for reads may still be beneficial and if you are careful enough > >> >not to do direct IO writes to the same range as you do buffered reads, this > >> >will work fine. > >> > > >> >Thinking some more - if you want to make it foolproof, you could implement > >> >something like read-only page cache for block devices. Any write will be in > >> >fact direct IO write, writeable mmaps will be disallowed, reads will honor > >> >O_DIRECT flag. > >> > >> Hi Jan, > >> > >> Indeed, using O_DIRECT for raw block device write is an obvious solution to > >> guarantee the application successful sequential writes within a zone. However, > >> host-managed SMR disks (and to a lesser extent host-aware drives too) already > >> put on applications the constraint of ensuring sequential writes. Adding to this > >> further mandatory rewrite to support direct I/Os is in my opinion asking a lot, > >> if not too much. > > > >So I don't think adding O_DIRECT to open flags is such a burden - > >sequential writes are IMO much harder to do :). And furthermore this could > >happen magically inside the kernel in which case app needn't be aware about > >this at all (similarly to how we handle writes to persistent memory). > > > >> The example you mention above of writeback skipping a locked page and resulting > >> in I/O errors is precisely what the proposed patch avoids by first locking the > >> zone the page belongs to. In the same spirit as the writeback page locking, if > >> the zone is already locked, it is skipped. That is, zones are treated in a sense > >> as gigantic pages, ensuring that the actual dirty pages within each one are > >> processed in one go, sequentially. > > > >But you cannot rule out mm subsystem locking a page to do something (e.g. > >migrate the page to help with compaction of large order pages). These other > >places accessing and locking pages are what I'm worried about. Furthermore > >kswapd can decide to writeback particular page under memory pressure and > >that will just make SMR disk freak out. > > > >> This allows preserving all possible application level accesses (buffered, > >> direct or mmapped). The only constraint is the one the disk imposes: > >> writes must be sequential. > >> > >> Granted, this view may be too simplistic and may be overlooking some hard > >> to track page locking paths which will compete with this. But I think > >> that this can be easily solved by forcing the zone-aligned > >> generic_writepages calls to not skip any page (a flag in struct > >> writeback_control would do the trick). And no modification is necessary > >> on the read side (i.e. page locking only is enough) since reading an SMR > >> disks blocks after a zone write-pointer position does not make sense (in > >> Hannes code, this is possible, but the request does not go to the disk > >> and returns garbage data). > >> > >> Bottom line: no fundamental change to the page caching mechanism, only > >> how it is being used/controlled for writeback makes this work. > >> Considering the benefits on the application side, it is in my opinion a > >> valid modification to have. > > > >See above, there are quite a few places which will break your assumptions. > >And I don't think changing them all to handle SMR is worth it. IMO caching > >sequential writes to SMR disks has low effect (if any) anyway so I would > >just avoid that. We can talk about how to make this as seamless to > >applications as possible. The only thing which I don't think is reasonably > >doable without dirtying pagecache are writeable mmaps of an SMR device so > >applications would have to avoid that. > > Jan, > > Thank you for your insight. > These "few places" breaking sequential write sequences are indeed > problematic for SMR drives. At the same time, I wonder how these paths > would react to an I/O error generated by the check "write at write > pointer" in the request submission path at the SCSI level. Could these be > ignored in the case of an "unaligned write error" ? That is, the page is > left dirty and hopefully the regular writeback path catches them later in > the proper sequence. You'd hope ;) But in fact what happens is that the page ends up being clean, marked as having error, and buffers will not be uptodate => you have just lost one page worth of data. See what happens in end_buffer_async_write(). Now our behavior in presence of IO errors needs improvement for a long time so you are certainly welcome to improve on this but what I described is what happens now. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html