On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote: > > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: > > > Sorry if this is out of left field as I haven't followed the discussion > > > closely, but I presumed one of the reasons Darrick and Christoph raised > > > the idea of using the folio batch thing I'm playing around with on zero > > > range for buffered writes would be to acquire and lock all targeted > > > folios up front. If so, would that help with what you're trying to > > > achieve here? (If not, nothing to see here, move along.. ;). > > > > I mostly thought about acquiring, as locking doesn't really have much > > batching effects. That being said, no that you got the idea in my mind > > here's my early morning brainfart on it: > > > > Let's ignore DIRECT I/O for the first step. In that case lookup / > > allocation and locking all folios for write before copying data will > > remove the need for i_rwsem in the read and write path. In a way that > > sounds perfect, and given that btrfs already does that (although in a > > very convoluted way) we know it's possible. > > Yes, this seems like a sane, general approach to allowing concurrent > buffered writes (and reads). > > > But direct I/O throws a big monkey wrench here as already mentioned by > > others. Now one interesting thing some file systems have done is > > to serialize buffered against direct I/O, either by waiting for one > > to finish, or by simply forcing buffered I/O when direct I/O would > > conflict. > > Right. We really don't want to downgrade to buffered IO if we can > help it, though. > > > It's easy to detect outstanding direct I/O using i_dio_count > > so buffered I/O could wait for that, and downgrading to buffered I/O > > (potentially using the new uncached mode from Jens) if there are any > > pages on the mapping after the invalidation also sounds pretty doable. > > It's much harder to sanely serialise DIO against buffered writes > this way, because i_dio_count only forms a submission barrier in > conjunction with the i_rwsem being held exclusively. e.g. ongoing > DIO would result in the buffered write being indefinitely delayed. Isn't this already the case today with EX vs. SH iolock? I guess the answer depends whether or not i_rwsem starves existing writers in the face of ongoing new readers. If my memory serves me right, this exact behavior of i_rwsem is what sparked the original regression report when xfs moved from i_iolock to i_rwsem [1]. [1] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@xxxxxxxxxxxxxx/ > > I think the model and method that bcachefs uses is probably the best > way to move forward - the "two-state exclusive shared" lock which it > uses to do buffered vs direct exclusion is a simple, easy way to > handle this problem. The same-state shared locking fast path is a > single atomic cmpxchg operation, so it has neglible extra overhead > compared to using a rwsem in the shared DIO fast path. > > The lock also has non-owner semantics, so DIO can take it during > submission and then drop it during IO completion. This solves the > problem we currently use the i_rwsem and > inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission > barrier and waiting for all existing DIO to drain). > > IOWs, a two-state shared lock provides the mechanism to allow DIO > to be done without holding the i_rwsem at all, as well as being able > to elide two atomic operations per DIO to track in-flight DIOs. > > We'd get this whilst maintaining buffered/DIO coherency without > adding any new overhead to the DIO path, and allow concurrent > buffered reads and writes that have their atomicity defined by the > batched folio locking strategy that Brian is working on... > > This only leaves DIO coherency issues with mmap() based IO as an > issue, but that's a problem for a different day... > > > I don't really have time to turn this hand waving into, but maybe we > > should think if it's worthwhile or if I'm missing something important. > > If people are OK with XFS moving to exclusive buffered or DIO > submission model, then I can find some time to work on the > converting the IO path locking to use a two-state shared lock in > preparation for the batched folio stuff that will allow concurrent > buffered writes... I won't object to getting the best of all worlds, but I have to say, upfront, this sounds a bit like premature optimization and for a workload (mixed buffered/dio write) that I don't think anybody does in practice and nobody should care how it performs. Am I wrong? For all practical purposes, we could maintain a counter in inode not for submitted DIO, but for files opened O_DIRECT. The first open O_DIRECT could serialize with in-flight buffered writes holding a shared iolock and then buffered writes could take SH vs. EX iolock depending on folio state and on i_dio_open_count. I would personally prefer a simple solution that is good enough and has a higher likelihood for allocating the development, review and testing resources that are needed to bring it to the finish line. Thanks, Amir.