On 08/11/2015 12:37 PM, Dave Chinner wrote: > On Tue, Aug 11, 2015 at 10:19:09AM +0200, Jan Kara wrote: >> On Mon 10-08-15 18:14:24, Kirill A. Shutemov wrote: >>> As we don't have struct pages for DAX memory, Matthew had to find an >>> replacement for lock_page() to avoid fault vs. truncate races. >>> i_mmap_lock was used for that. >>> >>> Recently, Matthew had to convert locking to exclusive to address fault >>> vs. fault races. And this kills scalability completely. > > I'm assuming this locking change is in a for-next git tree branch > somewhere as there isn't anything that I can see in a 4.2-rc6 > tree. Can you point me to the git tree that has these changes in it? > >>> The patch below tries to recover some scalability for DAX by introducing >>> per-mapping range lock. >> >> So this grows noticeably (3 longs if I'm right) struct address_space and >> thus struct inode just for DAX. That looks like a waste but I don't see an >> easy solution. >> >> OTOH filesystems in normal mode might want to use the range lock as well to >> provide truncate / punch hole vs page fault exclusion (XFS already has a >> private rwsem for this and ext4 needs something as well) and at that point >> growing generic struct inode would be acceptable for me. > > It sounds to me like the way DAX has tried to solve this race is the > wrong direction. We really need to drive the truncate/page fault > serialisation higher up the stack towards the filesystem, not deeper > into the mm subsystem where locking is greatly limited. > > As Jan mentions, we already have this serialisation in XFS, and I > think it would be better first step to replicate that locking model > in each filesystem that is supports DAX. I think this is a better > direction because it moves towards solving a whole class of problems > fileystem face with page fault serialisation, not just for DAX. > >> My grand plan was to use the range lock to also simplify locking >> rules for read, write and esp. direct IO but that has issues with >> mmap_sem ordering because filesystems get called under mmap_sem in >> page fault path. So probably just fixing the worst issue with >> punch hole vs page fault would be good for now. > > Right, I think adding a rwsem to the ext4 inode to handle the > fault/truncate serialisation similar to XFS would be sufficient to > allow DAX to remove the i_mmap_lock serialisation... > >> Also for a patch set like this, it would be good to show some numbers - how >> big hit do you take in the single-threaded case (the lock is more >> expensive) and how much scalability do you get in the multithreaded case? > > And also, if you remove the serialisation and run the test on XFS, > what do you get in terms of performance and correctness? > > Cheers, > Cheers indeed It is very easy to serialise at the FS level which has control over the truncate path as well. Is much harder and uglier on the mm side. Currently there is DAX for xfs, and ext2, ext4. With xfs's implementation I think removing the lock completely will just work. If I read the code correctly. With ext2/4 you can at worse add the struct range_lock_tree to its private inode structure and again only if dax is configured. And use that at the fault wrappers. But I suspect there should be an easier way once at the ext2/4 code level, and no need for generalization. I think the xfs case, of "no locks needed at all", calls for removing the locks at the mm/dax level and moving them up to the FS. > Dave. > Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html