[Cc linux-ext4 mailing list because we are planning to add extent-level locking in ext4] On Thu, Jan 24, 2013 at 12:26:07PM +0100, Jan Kara wrote: > Hello, > > I'd like to discuss idea of using range locking to serialize IO to a > range of pages. I have POC patches implementing the locking and converting > ext3 to use it. They pass xfstests and I plan to post them once I gather > some basic performace data (how much does the range locking cost us). I'd be interested to discuss this topic. Currently I am working on extent status tree for ext4 filesystem. The final goal of extent status tree is the implementation of extent-level locking (a range locking), which makes us be able to do parallel writes when different extents are manipulated. Regards, - Zheng > > Now to the details of the idea. There are several different motivations for > implementing mapping range locking: > a) Punch hole is currently racy wrt mmap (page can be faulted in in the > punched range after page cache has been invalidated) leading to nasty > results as fs corruption (we can end up writing to already freed block), > user exposure of uninitialized data, etc. To fix this we need some new > mechanism of serializing hole punching and page faults. > b) There is an uncomfortable number of mechanisms serializing various paths > manipulating pagecache and data underlying it. We have i_mutex, page lock, > checks for page beyond EOF in pagefault code, i_dio_count for direct IO. > Different pairs of operations are serialized by different mechanisms and > not all the cases are covered. Case (a) above is likely the worst but DIO > vs buffered IO isn't ideal either (we provide only limited consistency). > The range locking should somewhat simplify serialization of pagecache > operations. So i_dio_count can be removed completely, i_mutex to certain > extent (we still need something for things like timestamp updates, > possibly for i_size changes although those can be dealt with I think). > c) i_mutex doesn't allow any paralellism of operations using it and some > filesystems workaround this for specific cases (e.g. DIO reads). Using > range locking allows for concurrent operations (e.g. writes, DIO) on > different parts of the file. Of course, range locking itself isn't > enough to make the parallelism possible. Filesystems still have to > somehow deal with the concurrency when manipulating inode allocation > data. But the range locking at least provides a common VFS mechanism for > serialization VFS itself needs and it's upto each filesystem to > serialize more if it needs to. > > How it works: > > General idea is that range lock for range x-y prevents creation of pages in > that range. > > In practice this means: > All read paths adding page to page cache and grab_cache_page_write_begin() > first take range lock for the index, then insert locked page, and finally > unlock the range. See below on why buffered IO uses range locks on per-page > basis. > > DIO gets range lock at the moment it submits bio for the range covering > pages in the bio. Then pagecache is truncated and bio submitted. Range lock > is unlocked once bio is completed. > > Punch hole for range x-y takes range lock for the range before truncating > page cache and the lock is released after filesystem blocks for the range > are freed. > > Truncate to size x is equivalent to punch hole for the range x - ~0UL. > > The reason why we take the range lock for buffered IO on per-page basis and > for DIO for each bio separately is lock ordering with mmap_sem. Page faults > need to instantiate page under mmap_sem. That establishes mmap_sem > range > lock. Buffered IO takes mmap_sem when prefaulting pages so we cannot hold > range lock at that moment. Similarly get_user_pages() in DIO code takes > mmap_sem so we have be sure not to hold range lock when calling that. > > Honza > > -- > Jan Kara <jack@xxxxxxx> > SUSE Labs, CR > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html