On Tue, 7 Jan 2025 13:13:17 +0100, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@xxxxxxx> wrote: > > > > > > > > On 2024/12/29 06:17, Dave Chinner wrote: > > > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: > > >> > > >> > > >> On 2024/12/27 05:50, Dave Chinner wrote: > > >>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: > > >>>> From: Chi Zhiling <chizhiling@xxxxxxxxxx> > > >>>> > > >>>> Using an rwsem to protect file data ensures that we can always obtain a > > >>>> completed modification. But due to the lock, we need to wait for the > > >>>> write process to release the rwsem before we can read it, even if we are > > >>>> reading a different region of the file. This could take a lot of time > > >>>> when many processes need to write and read this file. > > >>>> > > >>>> On the other hand, The ext4 filesystem and others do not hold the lock > > >>>> during buffered reading, which make the ext4 have better performance in > > >>>> that case. Therefore, I think it will be fine if we remove the lock in > > >>>> xfs, as most applications can handle this situation. > > >>> > > >>> Nope. > > >>> > > >>> This means that XFS loses high level serialisation of incoming IO > > >>> against operations like truncate, fallocate, pnfs operations, etc. > > >>> > > >>> We've been through this multiple times before; the solution lies in > > >>> doing the work to make buffered writes use shared locking, not > > >>> removing shared locking from buffered reads. > > >> > > >> You mean using shared locking for buffered reads and writes, right? > > >> > > >> I think it's a great idea. In theory, write operations can be performed > > >> simultaneously if they write to different ranges. > > > > > > Even if they overlap, the folio locks will prevent concurrent writes > > > to the same range. > > > > > > Now that we have atomic write support as native functionality (i.e. > > > RWF_ATOMIC), we really should not have to care that much about > > > normal buffered IO being atomic. i.e. if the application wants > > > atomic writes, it can now specify that it wants atomic writes and so > > > we can relax the constraints we have on existing IO... > > > > Yes, I'm not particularly concerned about whether buffered I/O is > > atomic. I'm more concerned about the multithreading performance of > > buffered I/O. > > Hi Chi, > > Sorry for joining late, I was on vacation. > I am very happy that you have taken on this task and your timing is good, > because John Garry just posted his patches for large atomic writes [1]. > > I need to explain the relation to atomic buffered I/O, because it is not > easy to follow it from the old discussions and also some of the discussions > about the solution were held in-person during LSFMM2024. > > Naturally, your interest is improving multithreading performance of > buffered I/O, so was mine when I first posted this question [2]. > > The issue with atomicity of buffered I/O is the xfs has traditionally > provided atomicity of write vs. read (a.k.a no torn writes), which is > not required by POSIX standard (because POSIX was not written with > threads in mind) and is not respected by any other in-tree filesystem. > > It is obvious that the exclusive rw lock for buffered write hurts performance > in the case of mixed read/write on xfs, so the question was - if xfs provides > atomic semantics that portable applications cannot rely on, why bother > providing these atomic semantics at all? > > Dave's answer to this question was that there are some legacy applications > (database applications IIRC) on production systems that do rely on the fact > that xfs provides this semantics and on the prerequisite that they run on xfs. > > However, it was noted that: > 1. Those application do not require atomicity for any size of IO, they > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > and they only require no torn writes for this I/O size > 2. Large folios and iomap can usually provide this semantics via folio lock, > but application has currently no way of knowing if the semantics are > provided or not > 3. The upcoming ability of application to opt-in for atomic writes larger > than fs block size [1] can be used to facilitate the applications that > want to legacy xfs semantics and avoid the need to enforce the legacy > semantics all the time for no good reason > > Disclaimer: this is not a standard way to deal with potentially breaking > legacy semantics, because old applications will not usually be rebuilt > to opt-in for the old semantics, but the use case in hand is probably > so specific, of a specific application that relies on xfs specific semantics > that there are currently no objections for trying this solution. > > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@xxxxxxxxxx/ > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@xxxxxxxxxxxxxx/ > > > > > Last week, it was mentioned that removing i_rwsem would have some > > impacts on truncate, fallocate, and PNFS operations. > > > > (I'm not familiar with pNFS, so please correct me if I'm wrong.) > > You are not wrong. pNFS uses a "layout lease", which requires > that the blockmap of the file will not be modified while the lease is held. > but I think that block that are not mapped (i.e. holes) can be mapped > while the lease is held. > > > > > My understanding is that the current i_rwsem is used to protect both > > the file's data and its size. Operations like truncate, fallocate, > > and PNFS use i_rwsem because they modify both the file's data and its > > size. So, I'm thinking whether it's possible to use i_rwsem to protect > > only the file's size, without protecting the file's data. > > > > It also protects the file's blockmap, for example, punch hole > does not change the size, but it unmaps blocks from the blockmap, > leading to races that could end up reading stale data from disk > if the lock wasn't taken. > > > So operations that modify the file's size need to be executed > > sequentially. For example, buffered writes to the EOF, fallocate > > operations without the "keep size" requirement, and truncate operations, > > etc, all need to hold an exclusive lock. > > > > Other operations require a shared lock because they only need to access > > the file's size without modifying it. > > > > As far as I understand, exclusive lock is not needed for non-extending > writes, because it is ok to map new blocks. > I guess the need for exclusive lock for extending writes is related to > update of file size, but not 100% sure. > Anyway, exclusive lock on extending write is widely common in other fs, > while exclusive lock for non-extending write is unique to xfs. I am sorry but I can't understand, because I found ext4_buffered_write_iter() take exclusive lock unconditionally. Did I miss some context of the discussion? Thank you very much. Jinliang Zheng > > > > > > >> So we should track all the ranges we are reading or writing, > > >> and check whether the new read or write operations can be performed > > >> concurrently with the current operations. > > > > > > That is all discussed in detail in the discussions I linked. > > > > Sorry, I overlooked some details from old discussion last time. > > It seems that you are not satisfied with the effectiveness of > > range locks. > > > > Correct. This solution was taken off the table. > > I hope my explanation was correct and clear. > If anything else is not clear, please feel free to ask. > > Thanks, > Amir.