On 21/06/02 09:15AM, Dave Chinner wrote: > On Mon, May 31, 2021 at 11:28:25PM +0530, riteshh wrote: > > On 21/05/19 11:19AM, Dave Chinner wrote: > > > --- a/fs/xfs/xfs_file.c > > > +++ b/fs/xfs/xfs_file.c > > > @@ -384,21 +384,30 @@ xfs_file_write_checks( > > > } > > > goto restart; > > > } > > > + > > > /* > > > * If the offset is beyond the size of the file, we need to zero any > > > * blocks that fall between the existing EOF and the start of this > > > - * write. If zeroing is needed and we are currently holding the > > > - * iolock shared, we need to update it to exclusive which implies > > > - * having to redo all checks before. > > > + * write. If zeroing is needed and we are currently holding the iolock > > > + * shared, we need to update it to exclusive which implies having to > > > + * redo all checks before. > > > + * > > > + * We need to serialise against EOF updates that occur in IO completions > > > + * here. We want to make sure that nobody is changing the size while we > > > + * do this check until we have placed an IO barrier (i.e. hold the > > > + * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. The > > > + * spinlock effectively forms a memory barrier once we have the > > > + * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and > > > + * hence be able to correctly determine if we need to run zeroing. > > > * > > > - * We need to serialise against EOF updates that occur in IO > > > - * completions here. We want to make sure that nobody is changing the > > > - * size while we do this check until we have placed an IO barrier (i.e. > > > - * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. > > > - * The spinlock effectively forms a memory barrier once we have the > > > - * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value > > > - * and hence be able to correctly determine if we need to run zeroing. > > > + * We can do an unlocked check here safely as IO completion can only > > > + * extend EOF. Truncate is locked out at this point, so the EOF can > > > + * not move backwards, only forwards. Hence we only need to take the > > > + * slow path and spin locks when we are at or beyond the current EOF. > > > */ > > > + if (iocb->ki_pos <= i_size_read(inode)) > > > + goto out; > > > + > > > spin_lock(&ip->i_flags_lock); > > > isize = i_size_read(inode); > > > if (iocb->ki_pos > isize) { > > > > Hello Dave/Jan, > > > > Sorry about some silly queries here. But locking sometimes can get confusing and > > needs a background context/history. > > > > So, > > I was going through the XFS DIO path and I couldn't completely get this below > > difference between xfs_file_dio_write_unaligned() v/s > > xfs_file_dio_write_aligned() checks for taking xfs iolock (inode rwsem) > > with different exclusivity(exclusive v/s shared). > > > > I in xfs_**_unaligned() function, we also check if (ki_pos + count >= isize()). > > If yes, then we go for an exclusive iolock. > > While in xfs_**_aligned() function, we always take shared iolock. > > > > Can you please help me understand why is that? In case of an extending aligned > > write, won't we need an exclusive iolock for XFS? > > No. Extending the file is a slowpath operation which requires > exclusive locking. We always take the shared lock first if we can > because that's the normal fast path operation and so we optimise for > that case. > > In the aligned DIO case, we check for sub-block EOF zeroing in > xfs_file_write_checks(). If needed, we upgrade the lock to exclusive > while the EOF zeroing is done. Once we return back to the aligned IO > code, we'll demote that exclusive lock back to shared for the block > aligned IO that we are issuing. > > > Or IIUC, this exclusive lock is mostly needed to prevent two sub-bock zeroing > > from running in parallel (which if this happens could cause corruption) > > and this can only happen with unaligned writes. > > The exclusive lock is needed for serialising zeroing operations, > whether it be zeroing for EOF extension or sub-block zeroing for > unaligned writes. > > The reason for the EOF checks in the unaligned case is right there > in the comment above the EOF checks: > > /* > * Extending writes need exclusivity because of the sub-block zeroing > * that the DIO code always does for partial tail blocks beyond EOF, so > * don't even bother trying the fast path in this case. > */ > > IOWs, there is no possible "fast path" shared locking case for > unaligned extending DIOs, so we just take the exlusive lock right > from the start. > > > Whereas, I guess ext4, still does exclusive lock even with aligned extending > > writes, possibly because of updation of inode->i_size and orphan inode > > handling requires it to take exclusive inode rwsem. > > > > While for XFS inode->i_size updation happens with a different spinlock which is > > ip->i_flags_lock. > > XFS is serialising DIO completion against DIO submission here rather > than anything else to do with inode size updates which are, > generally speaking, serialised at a higher level by various > combinations of i_rwsem, MMAPLOCK, ILOCK and inode_dio_wait(). Thanks a lot Dave for detailed explaination about this. This makes things quite clear now from XFS side. -ritesh > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx