On Fri 08-11-19 14:12:38, Jan Kara wrote: > On Mon 21-10-19 15:49:31, Ira Weiny wrote: > > On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote: > > > On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.weiny@xxxxxxxxx wrote: > > > That, fundamentally, is the issue here - it's not setting/clearing > > > the DAX flag that is the issue, it's doing a swap of the > > > mapping->a_ops while there may be other code using that ops > > > structure. > > > > > > IOWs, if there is any code anywhere in the kernel that > > > calls an address space op without holding one of the three locks we > > > hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap > > > of the address space operations. > > > > > > By limiting the address space swap to file sizes of zero, we rule > > > out the page fault path (mmap of a zero length file segv's with an > > > access beyond EOF on the first read/write page fault, right?). > > > > Yes I checked that and thought we were safe here... > > > > > However, other aops callers that might run unlocked and do the wrong > > > thing if the aops pointer is swapped between check of the aop method > > > existing and actually calling it even if the file size is zero? > > > > > > A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible > > > to such a race condition with the current definitions of the XFS DAX > > > aops. I'm guessing there will be others, but I haven't looked > > > further than this... > > > > I'll check for others and think on what to do about this. ext4 will have the > > same problem I think. :-( > > Just as a datapoint, ext4 is bold and sets inode->i_mapping->a_ops on > existing inodes when switching journal data flag and so far it has not > blown up. What we did to deal with issues Dave describes is that we > introduced percpu rw-semaphore guarding switching of aops and then inside > problematic functions redirect callbacks in the right direction under this > semaphore. Somewhat ugly but it seems to work. Thinking about this some more, perhaps this scheme could be actually transformed in something workable. We could have a global (or maybe per-sb but I'm not sure it's worth it) percpu rwsem and we could transform aops calls into: percpu_down_read(aops_rwsem); do_call(); percpu_up_read(aops_rwsem); With some macro magic it needn't be even that ugly. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR