On Wed, Jan 8, 2025 at 12:33 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > On Wed, Jan 8, 2025 at 8:43 AM Chi Zhiling <chizhiling@xxxxxxx> wrote: > > > > On 2025/1/7 20:13, Amir Goldstein wrote: > > > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@xxxxxxx> wrote: > > >> On 2024/12/29 06:17, Dave Chinner wrote: > > >>> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: > > >>>> On 2024/12/27 05:50, Dave Chinner wrote: > > >>>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: > > >>>>>> From: Chi Zhiling <chizhiling@xxxxxxxxxx> > > >>>>>> > > >>>>>> Using an rwsem to protect file data ensures that we can always obtain a > > >>>>>> completed modification. But due to the lock, we need to wait for the > > >>>>>> write process to release the rwsem before we can read it, even if we are > > >>>>>> reading a different region of the file. This could take a lot of time > > >>>>>> when many processes need to write and read this file. > > >>>>>> > > >>>>>> On the other hand, The ext4 filesystem and others do not hold the lock > > >>>>>> during buffered reading, which make the ext4 have better performance in > > >>>>>> that case. Therefore, I think it will be fine if we remove the lock in > > >>>>>> xfs, as most applications can handle this situation. > > >>>>> > > >>>>> Nope. > > >>>>> > > >>>>> This means that XFS loses high level serialisation of incoming IO > > >>>>> against operations like truncate, fallocate, pnfs operations, etc. > > >>>>> > > >>>>> We've been through this multiple times before; the solution lies in > > >>>>> doing the work to make buffered writes use shared locking, not > > >>>>> removing shared locking from buffered reads. > > >>>> > > >>>> You mean using shared locking for buffered reads and writes, right? > > >>>> > > >>>> I think it's a great idea. In theory, write operations can be performed > > >>>> simultaneously if they write to different ranges. > > >>> > > >>> Even if they overlap, the folio locks will prevent concurrent writes > > >>> to the same range. > > >>> > > >>> Now that we have atomic write support as native functionality (i.e. > > >>> RWF_ATOMIC), we really should not have to care that much about > > >>> normal buffered IO being atomic. i.e. if the application wants > > >>> atomic writes, it can now specify that it wants atomic writes and so > > >>> we can relax the constraints we have on existing IO... > > >> > > >> Yes, I'm not particularly concerned about whether buffered I/O is > > >> atomic. I'm more concerned about the multithreading performance of > > >> buffered I/O. > > > > > > Hi Chi, > > > > > > Sorry for joining late, I was on vacation. > > > I am very happy that you have taken on this task and your timing is good, > > > because John Garry just posted his patches for large atomic writes [1]. > > > > I'm glad to have you on board. :) > > > > I'm not sure if my understanding is correct, but it seems that our > > discussion is about multithreading safety, while John's patch is about > > providing power-failure safety, even though both mention atomicity. > > > > You are right about the *motivation* of John's work. > But as a by-product of his work, we get an API to opt-in for > atomic writes and we can piggy back this opt-in API to say > that whoever wants the legacy behavior can use the new API > to get both power failure safety and multithreading safety. > > > > > > > I need to explain the relation to atomic buffered I/O, because it is not > > > easy to follow it from the old discussions and also some of the discussions > > > about the solution were held in-person during LSFMM2024. > > > > > > Naturally, your interest is improving multithreading performance of > > > buffered I/O, so was mine when I first posted this question [2]. > > > > > > The issue with atomicity of buffered I/O is the xfs has traditionally > > > provided atomicity of write vs. read (a.k.a no torn writes), which is > > > not required by POSIX standard (because POSIX was not written with > > > threads in mind) and is not respected by any other in-tree filesystem. > > > > > > It is obvious that the exclusive rw lock for buffered write hurts performance > > > in the case of mixed read/write on xfs, so the question was - if xfs provides > > > atomic semantics that portable applications cannot rely on, why bother > > > providing these atomic semantics at all? > > > > Perhaps we can add an option that allows distributions or users to > > decide whether they need this semantics. I would not hesitate to > > turn off this semantics on my system when the time comes. > > > > Yes, we can, but we do not want to do it unless we have to. > 99% of the users do not want the legacy semantics, so it would > be better if they get the best performance without having to opt-in to get it. > > > > > > > Dave's answer to this question was that there are some legacy applications > > > (database applications IIRC) on production systems that do rely on the fact > > > that xfs provides this semantics and on the prerequisite that they run on xfs. > > > > > > However, it was noted that: > > > 1. Those application do not require atomicity for any size of IO, they > > > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > > > and they only require no torn writes for this I/O size > > > 2. Large folios and iomap can usually provide this semantics via folio lock, > > > but application has currently no way of knowing if the semantics are > > > provided or not > > > > To be honest, it would be best if the folio lock could provide such > > semantics, as it would not cause any potential problems for the > > application, and we have hope to achieve concurrent writes. > > > > However, I am not sure if this is easy to implement and will not cause > > other problems. > > > > > 3. The upcoming ability of application to opt-in for atomic writes larger > > > than fs block size [1] can be used to facilitate the applications that > > > want to legacy xfs semantics and avoid the need to enforce the legacy > > > semantics all the time for no good reason > > > > > > Disclaimer: this is not a standard way to deal with potentially breaking > > > legacy semantics, because old applications will not usually be rebuilt > > > to opt-in for the old semantics, but the use case in hand is probably > > > so specific, of a specific application that relies on xfs specific semantics > > > that there are currently no objections for trying this solution. > > > > > > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@xxxxxxxxxx/ > > > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@xxxxxxxxxxxxxx/ > > > > > >> > > >> Last week, it was mentioned that removing i_rwsem would have some > > >> impacts on truncate, fallocate, and PNFS operations. > > >> > > >> (I'm not familiar with pNFS, so please correct me if I'm wrong.) > > > > > > You are not wrong. pNFS uses a "layout lease", which requires > > > that the blockmap of the file will not be modified while the lease is held. > > > but I think that block that are not mapped (i.e. holes) can be mapped > > > while the lease is held. > > > > > >> > > >> My understanding is that the current i_rwsem is used to protect both > > >> the file's data and its size. Operations like truncate, fallocate, > > >> and PNFS use i_rwsem because they modify both the file's data and its > > >> size. So, I'm thinking whether it's possible to use i_rwsem to protect > > >> only the file's size, without protecting the file's data. > > >> > > > > > > It also protects the file's blockmap, for example, punch hole > > > does not change the size, but it unmaps blocks from the blockmap, > > > leading to races that could end up reading stale data from disk > > > if the lock wasn't taken. > > > > > >> So operations that modify the file's size need to be executed > > >> sequentially. For example, buffered writes to the EOF, fallocate > > >> operations without the "keep size" requirement, and truncate operations, > > >> etc, all need to hold an exclusive lock. > > >> > > >> Other operations require a shared lock because they only need to access > > >> the file's size without modifying it. > > >> > > > > > > As far as I understand, exclusive lock is not needed for non-extending > > > writes, because it is ok to map new blocks. > > > I guess the need for exclusive lock for extending writes is related to > > > update of file size, but not 100% sure. > > > Anyway, exclusive lock on extending write is widely common in other fs, > > > while exclusive lock for non-extending write is unique to xfs. > > > > > >>> > > >>>> So we should track all the ranges we are reading or writing, > > >>>> and check whether the new read or write operations can be performed > > >>>> concurrently with the current operations. > > >>> > > >>> That is all discussed in detail in the discussions I linked. > > >> > > >> Sorry, I overlooked some details from old discussion last time. > > >> It seems that you are not satisfied with the effectiveness of > > >> range locks. > > >> > > > > > > Correct. This solution was taken off the table. > > > > > > I hope my explanation was correct and clear. > > > If anything else is not clear, please feel free to ask. > > > > > > > I think your explanation is very clear. > > One more thing I should mention. > You do not need to wait for atomic large writes patches to land. > There is nothing stopping you from implementing the suggested > solution based on the xfs code already in master (v6.13-rc1), > which has support for the RWF_ATOMIC flag for writes. > > It just means that the API will not be usable for applications that > want to do IO larger than block size, but concurrent read/write > performance of 4K IO could be improved already. > > It's possible that all you need to do is: > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index c488ae26b23d0..2542f15496488 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -777,9 +777,10 @@ xfs_file_buffered_write( > ssize_t ret; > bool cleared_space = false; > unsigned int iolock; > + bool atomic_write = iocb->ki_flags & IOCB_ATOMIC; > > write_retry: > - iolock = XFS_IOLOCK_EXCL; > + iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL; It's the other way around of course :) and did not state the obvious - this mock patch is NOT TESTED! Thanks, Amir. > ret = xfs_ilock_iocb(iocb, iolock); > -- > > xfs_file_write_checks() afterwards already takes care of promoting > XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes. > > It is possible that XFS_IOLOCK_EXCL could be immediately demoted > back to XFS_IOLOCK_SHARED for atomic_writes as done in > xfs_file_dio_write_aligned(). > > TBH, I am not sure which blockdevs support 4K atomic writes that could > be used to test this. > > John, can you share your test setup instructions for atomic writes? > > Thanks, > Amir.