On Tue, Dec 03, 2013 at 05:14:15PM -0600, Ben Myers wrote: > On Wed, Dec 04, 2013 at 09:55:20AM +1100, Dave Chinner wrote: > > On Tue, Dec 03, 2013 at 03:29:51PM -0600, Ben Myers wrote: > > > Although it was removed in commit 051e7cd44ab8, ilock needs to be taken in > > > xfs_readdir because we might have to read the extent list in from disk. This > > > keeps other threads from reading from or writing to the extent list while it is > > > being read in and is still in a transitional state. > > > > > > This has been associated with "Access to block zero" messages on directories > > > with large numbers of extents resulting from excessive filesytem fragmentation, > > > as well as extent list corruption. Unfortunately no test case at this point. > > > > That commit was from 2007, so this is not a result of a recent > > change. As it is, access to directories is completely > > serialised by the VFS i_mutex. We don't need the iolock to read in > > the extent list for a directory, because by the time we get to > > readdir it is already guaranteed to be read in by this code: > > Gah. Typo in my subject line. That's the ilock, not the iolock. Whenever we > touch the extent list we want to have the ilock at some level. I just copied what you said. ;) > > STATIC int > > xfs_dir_open( > > struct inode *inode, > > struct file *file) > > { > > struct xfs_inode *ip = XFS_I(inode); > > int mode; > > int error; > > > > error = xfs_file_open(inode, file); > > if (error) > > return error; > > > > /* > > * If there are any blocks, read-ahead block 0 as we're almost > > * certain to have the next operation be a read there. > > */ > > mode = xfs_ilock_map_shared(ip); > > if (ip->i_d.di_nextents > 0) > > xfs_dir3_data_readahead(NULL, ip, 0, -1); > > xfs_iunlock(ip, mode); > > return 0; > > } > > > > Which means that for a directory in block/leaf/node form the open of > > it prior to the first readdir() call will read the extent tree into > > memory and it will be pinned in memory for at least the life of the > > file descriptor that is open on the directory. > > That readahead _may_ read some of the blocks from disk but it doesn't copy them > into the incore extent list. Readahead reads data blocks based on their file offset. How do we find the data block to read from the file offset? It looks it up in the extent list via xfs_bmapi_read(), which reads the entire extent map from disk and populates the incore extent tree in it's entirity if it is not present. i.e. xfs_dir3_data_readahead xfs_da_reada_buf xfs_dabuf_map if (mappedbno == -1) xfs_bmapi_read if (!XFS_IFEXTENTS) xfs_iread_extents set XFS_IFEXTENTS xfs_bmap_read_extents <reads extent tree from disk> <populate incore extent list> <map offset 0 to filesystem block> mappedbno = filesystem block from xfs_bmapi_read xfs_buf_reada(mappedbno) (reads directory data from disk) Hence xfs_dir_open() has the correct locking and ensures that we don't need to use the extent list lock for any read operations through the VFS because all read operations are serialised against modifications to the directory the i_mutex at the VFS. > That's what we need to protect from other readers > who don't take the i_mutex. Will see if I can find an example. The only way I can see there being a problem is if the directory extent tree is being modified without the directory i_mutex being held. AFAICT, that can't happen from the VFS, so maybe there's something that is coming from the ioctl interfaces - the only thing I can see there is xfs_attrmulti_attr_set() causing a directory inode data fork to move from extent to btree format as an attribute fork is added, but that wouldn't cause problems with existing directories with large extent counts. Also, nothing in xfstests/xfsprogs uses XFS_IOC_ATTRMULTI_BY_HANDLE or the libhandle functions that call this ioctl, so it would seem to me that the only thing that might use this function to set attributes on files is DMF..... Hmmm, that makes me wonder about whether this is a CXFS problem, because it has hooks deep in the XFS code below the VFS. Last time I looked, the MDS completely bypassed the VFS for directory modifications driven by clients. That would explain why nobody has seen this problem on mainline kernels for the best part of 6 years, and also explain why the ilock is necessary across the readdir operation to prevent the extent list problems from being seen.... Ben - if this is a DMF or CXFS problem, just say so. Some of us know an awful lot about how those products work, so there's no point wasting my time trying to dance around why a change needs to be made without mentioning those products. The simple fact is that if we ever want to do concurrent directory read operations, we have to take the ilock in readdir() to ensure we can serialise correctly against modifications because the i_mutex can't be used to do that. So, really, I'm not against the fix you proposed - I'm just trying to understand why it is necessary right now.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs