On Fri, Jan 17, 2025 at 11:19 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Fri, Jan 17, 2025 at 02:27:46PM +0100, Amir Goldstein wrote: > > On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote: > > > > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: > > > > > Sorry if this is out of left field as I haven't followed the discussion > > > > > closely, but I presumed one of the reasons Darrick and Christoph raised > > > > > the idea of using the folio batch thing I'm playing around with on zero > > > > > range for buffered writes would be to acquire and lock all targeted > > > > > folios up front. If so, would that help with what you're trying to > > > > > achieve here? (If not, nothing to see here, move along.. ;). > > > > > > > > I mostly thought about acquiring, as locking doesn't really have much > > > > batching effects. That being said, no that you got the idea in my mind > > > > here's my early morning brainfart on it: > > > > > > > > Let's ignore DIRECT I/O for the first step. In that case lookup / > > > > allocation and locking all folios for write before copying data will > > > > remove the need for i_rwsem in the read and write path. In a way that > > > > sounds perfect, and given that btrfs already does that (although in a > > > > very convoluted way) we know it's possible. > > > > > > Yes, this seems like a sane, general approach to allowing concurrent > > > buffered writes (and reads). > > > > > > > But direct I/O throws a big monkey wrench here as already mentioned by > > > > others. Now one interesting thing some file systems have done is > > > > to serialize buffered against direct I/O, either by waiting for one > > > > to finish, or by simply forcing buffered I/O when direct I/O would > > > > conflict. > > > > > > Right. We really don't want to downgrade to buffered IO if we can > > > help it, though. > > > > > > > It's easy to detect outstanding direct I/O using i_dio_count > > > > so buffered I/O could wait for that, and downgrading to buffered I/O > > > > (potentially using the new uncached mode from Jens) if there are any > > > > pages on the mapping after the invalidation also sounds pretty doable. > > > > > > It's much harder to sanely serialise DIO against buffered writes > > > this way, because i_dio_count only forms a submission barrier in > > > conjunction with the i_rwsem being held exclusively. e.g. ongoing > > > DIO would result in the buffered write being indefinitely delayed. > > > > Isn't this already the case today with EX vs. SH iolock? > > No. We do not hold the i_rwsem across async DIO read or write, so > we can have DIO in flight whilst a buffered write grabs and holds > the i_rwsem exclusive. This is the problem that i_dio_count solves > for truncate, etc. i.e. the only way to actually ensure there is no > DIO in flight is to grab the i_rwsem exclusive and then call > inode_dio_wait() to ensure all async DIO in flight completes before > continuing. > > > I guess the answer depends whether or not i_rwsem > > starves existing writers in the face of ongoing new readers. > > It does not. If the lock is held for read and there is a pending > write, new readers are queued behind the pending write waiters > (see rwsem_down_read_slowpath(), which is triggered if there are > pending waiters on an attempt to read lock). > > > > > I don't really have time to turn this hand waving into, but maybe we > > > > should think if it's worthwhile or if I'm missing something important. > > > > > > If people are OK with XFS moving to exclusive buffered or DIO > > > submission model, then I can find some time to work on the > > > converting the IO path locking to use a two-state shared lock in > > > preparation for the batched folio stuff that will allow concurrent > > > buffered writes... > > > > I won't object to getting the best of all worlds, but I have to say, > > upfront, this sounds a bit like premature optimization and for > > a workload (mixed buffered/dio write) > > Say what? The dio/buffered exclusion change provides a different > data coherency and correctness model that is required to then > optimising the workload you care about - mixed buffered read/write. > > This change doesn't optimise either DIO or buffered IO, nor is a > shared-exclusive BIO/DIO lock isn't going to improve performance of > such mixed IO workloads in any way. IOWs, it's pretty hard to call > a locking model change like this "optimisation", let alone call it > "premature". > I was questioning the need to optimize the mixed buffered read/write *while file is also open for O_DIRECT* I thought that the solution would be easier if can take Christof's "Let's ignore DIRECT I/O for the first step" and make it testable. Then if the inode is open O_DIRECT, no change to locking. > > that I don't think anybody > > does in practice and nobody should care how it performs. > > Am I wrong? > > Yes and no. mixed buffered/direct IO worklaods are more common than > you think. e.g. many backup programs use direct IO and so inherently > mix DIO with buffered IO for the majority of files the backup > program touches. Still feels like we can forego performance of mixed buffered rw *while the backup program reads the database file*, should a backup program ever really read a large database file... > However, all we care about in this case is data > coherency, not performance, and this change should improve data > coherency between DIO and buffered IO... > I did not understand that you are proposing an improvement over the existing state of affairs. Yeh, I certainly have no objections to fixing things that are wrong. > > For all practical purposes, we could maintain a counter in inode > > not for submitted DIO, but for files opened O_DIRECT. > > > > The first open O_DIRECT could serialize with in-flight > > buffered writes holding a shared iolock and then buffered writes > > could take SH vs. EX iolock depending on folio state and on > > i_dio_open_count. > > I don't see how this can be made to work w.r.t. sane data coherency > behaviour. e.g. how does this model serialise a new DIO write that > is submitted after a share-locked buffered write has just started > and passed all the "i_dio_count == 0" checks that enable it to use > shared locking? i.e. we now have potentially overlapping buffered > writes and DIO writes being done concurrently because the buffered > write may not have instantiated folios in the page cache yet.... > A shared-locked buffered write can only start when there is no O_DIRECT file open on the inode. The first open for O_DIRECT to increment i_dio_open_count needs to take exclusive iolock to wait for all in-flight buffered writes then release the iolock. All DIO submitted via O_DIRECT fds will be safe against in-flight share-locked buffered write. > > I would personally prefer a simple solution that is good enough > > and has a higher likelihood for allocating the development, review > > and testing resources that are needed to bring it to the finish line. > > Have you looked at how simple the bcachefs buffered/dio exclusion > implementation is? The lock mechanism itself is about 50 lines of > code, and there are only 4 or 5 places where the lock is actually > needed. It doesn't get much simpler than that. > > And, quite frankly, the fact the bcachefs solution also covers AIO > DIO in flight (which i_rwsem based locking does not!) means it is a > more robust solution than trying to rely on racy i_dio_count hacks > and folio residency in the page cache... I will be happy to see this improvement. Thanks, Amir.