On Mon, Dec 19, 2011 at 12:43:43PM +1100, Dave Chinner wrote: > > We have a shitload of deadlocks on very common paths with that patch. What > > of the paths that do lead to file_remove_suid() without i_mutex? > > * xfs_file_aio_write_checks(): we drop i_mutex (via xfs_rw_iunlock()) > > just before calling file_remove_suid(). Racy, the fix is obvious - move > > file_remove_suid() call before unlocking. > > Not exactly. xfs_rw_iunlock() is not doing what you think it's doing > there..... Huh? It is called as > > - xfs_rw_iunlock(ip, XFS_ILOCK_EXCL); and thus in static inline void xfs_rw_iunlock( struct xfs_inode *ip, int type) { xfs_iunlock(ip, type); if (type & XFS_IOLOCK_EXCL) mutex_unlock(&VFS_I(ip)->i_mutex); } we are guaranteed to hit i_mutex. > Wrong lock. That's dropping the internal XFS inode metadata lock, > but the VFS i_mutex is associated with the internal XFS inode IO > lock, which is accessed via XFS_IOLOCK_*. Only if we take the iolock > via XFS_IOLOCK_EXCL do we actually take the i_mutex. > Now it gets complex. For buffered IO, we are guaranteed to already > be holding the i_mutex because we do: > > *iolock = XFS_IOLOCK_EXCL; > xfs_rw_ilock(ip, *iolock); > > ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock); > > So that is safe and non-racy right now. No, it is not - we *drop* it before calling file_remove_suid(). Explicitly. Again, look at that xfs_rw_iunlock() call there - it does drop i_mutex (which is to say, you'd better have taken it prior to that, or you have far worse problems). > For direct IO, however, we don't always take the IOLOCK exclusively. > Indeed, we try really, really hard not to do this so we can do > concurrent reads and writes to the inode, and that results > in a bunch of lock juggling when we actually need the IOLOCK > exclusive (like in xfs_file_aio_write_checks()). It sounds like we > need to know if we are going to have to remove the SUID bit ahead of > time so that we can take the correct lock up front. I haven't > looked at what is needed to do that yet. OK, I'm definitely missing something. The very first thing xfs_file_aio_write_checks() does is xfs_rw_ilock(ip, XFS_ILOCK_EXCL); which really makes me wonder how the hell does that manage to avoid an instant deadlock in case of call via xfs_file_buffered_aio_write() where we have: struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; struct xfs_inode *ip = XFS_I(inode); *iolock = XFS_IOLOCK_EXCL; xfs_rw_ilock(ip, *iolock); ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock); which leads to struct inode *inode = file->f_mapping->host; struct xfs_inode *ip = XFS_I(inode); (IOW, inode and ip are the same as in the caller) followed by xfs_rw_ilock(ip, XFS_ILOCK_EXCL); and with both xfs_rw_ilock() calls turning into mutex_lock(&VFS_I(ip)->i_mutex); xfs_ilock(ip, XFS_ILOCK_EXCL); we ought to deadlock on that i_mutex. What am I missing and how do we manage to survive that? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>