On Thu, Dec 7, 2023 at 8:38 PM Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote: > > > > On 12/7/23 08:39, Amir Goldstein wrote: > > On Thu, Dec 7, 2023 at 1:28 AM Bernd Schubert > > <bernd.schubert@xxxxxxxxxxx> wrote: > >> > >> > >> > >> On 12/6/23 09:25, Amir Goldstein wrote: > >>>>>> Is it actually important for FUSE_DIRECT_IO_ALLOW_MMAP fs > >>>>>> (e.g. virtiofsd) to support FOPEN_PARALLEL_DIRECT_WRITES? > >>>>>> I guess not otherwise, the combination would have been tested. > >>>>> > >>>>> I'm not sure how many people are aware of these different flags/features. > >>>>> I had just finalized the backport of the related patches to RHEL8 on > >>>>> Friday, as we (or our customers) need both for different jobs. > >>>>> > >>>>>> > >>>>>> FOPEN_PARALLEL_DIRECT_WRITES is typically important for > >>>>>> network fs and FUSE_DIRECT_IO_ALLOW_MMAP is typically not > >>>>>> for network fs. Right? > >>>>> > >>>>> We kind of have these use cases for our network file systems > >>>>> > >>>>> FOPEN_PARALLEL_DIRECT_WRITES: > >>>>> - Traditional HPC, large files, parallel IO > >>>>> - Large file used on local node as container for many small files > >>>>> > >>>>> FUSE_DIRECT_IO_ALLOW_MMAP: > >>>>> - compilation through gcc (not so important, just not nice when it > >>>>> does not work) > >>>>> - rather recent: python libraries using mmap _reads_. As it is read > >>>>> only no issue of consistency. > >>>>> > >>>>> > >>>>> These jobs do not intermix - no issue as in generic/095. If such > >>>>> applications really exist, I have no issue with a serialization penalty. > >>>>> Just disabling FOPEN_PARALLEL_DIRECT_WRITES because other > >>>>> nodes/applications need FUSE_DIRECT_IO_ALLOW_MMAP is not so nice. > >>>>> > >>>>> Final goal is also to have FOPEN_PARALLEL_DIRECT_WRITES to work on plain > >>>>> O_DIRECT and not only for FUSE_DIRECT_IO - I need to update this branch > >>>>> and post the next version > >>>>> https://github.com/bsbernd/linux/commits/fuse-dio-v4 > >>>>> > >>>>> > >>>>> In the mean time I have another idea how to solve > >>>>> FOPEN_PARALLEL_DIRECT_WRITES + FUSE_DIRECT_IO_ALLOW_MMAP > >>>> > >>>> Please find attached what I had in my mind. With that generic/095 is not > >>>> crashing for me anymore. I just finished the initial coding - it still > >>>> needs a bit cleanup and maybe a few comments. > >>>> > >>> > >>> Nice. I like the FUSE_I_CACHE_WRITES state. > >>> For FUSE_PASSTHROUGH I will need to track if inode is open/mapped > >>> in caching mode, so FUSE_I_CACHE_WRITES can be cleared on release > >>> of the last open file of the inode. > >>> > >>> I did not understand some of the complexity here: > >>> > >>>> /* The inode ever got page writes and we do not know for sure > >>>> * in the DIO path if these are pending - shared lock not possible */ > >>>> spin_lock(&fi->lock); > >>>> if (!test_bit(FUSE_I_CACHE_WRITES, &fi->state)) { > >>>> if (!(*cnt_increased)) { > >>> > >>> How can *cnt_increased be true here? > >> > >> I think you missed the 2nd entry into this function, when the shared > >> lock was already taken? > > > > Yeh, I did. > > > >> I have changed the code now to have all > >> complexity in this function (test, lock, retest with lock, release, > >> wakeup). I hope that will make it easier to see the intention of the > >> code. Will post the new patches in the morning. > >> > > > > Sounds good. Current version was a bit hard to follow. > > > >> > >>> > >>>> fi->shared_lock_direct_io_ctr++; > >>>> *cnt_increased = true; > >>>> } > >>>> excl_lock = false; > >>> > >>> Seems like in every outcome of this function > >>> *cnt_increased = !excl_lock > >>> so there is not need for out arg cnt_increased > >> > >> If excl_lock would be used as input - yeah, would have worked as well. > >> Or a parameter like "retest-under-lock". Code is changed now to avoid > >> going in and out. > >> > >>> > >>>> } > >>>> spin_unlock(&fi->lock); > >>>> > >>>> out: > >>>> if (excl_lock && *cnt_increased) { > >>>> bool wake = false; > >>>> spin_lock(&fi->lock); > >>>> if (--fi->shared_lock_direct_io_ctr == 0) > >>>> wake = true; > >>>> spin_unlock(&fi->lock); > >>>> if (wake) > >>>> wake_up(&fi->direct_io_waitq); > >>>> } > >>> > >>> I don't see how this wake_up code is reachable. > >>> > >>> TBH, I don't fully understand the expected result. > >>> Surely, the behavior of dio mixed with mmap is undefined. Right? > >>> IIUC, your patch does not prevent dirtying page cache while dio is in > >>> flight. It only prevents writeback while dio is in flight, which is the same > >>> behavior as with exclusive inode lock. Right? > >> > >> Yeah, thanks. I will add it in the patch description. > >> > >> And there was actually an issue with the patch, as cache flushing needs > >> to be initiated before doing the lock decision, fixed now. > >> > > > > I thought there was, because of the wait in fuse_send_writepage() > > but wasn't sure if I was following the flow correctly. > > > >>> > >>> Maybe this interaction is spelled out somewhere else, but if not > >>> better spell it out for people like me that are new to this code. > >> > >> Sure, thanks a lot for your helpful comments! > >> > > > > Just to be clear, this patch looks like a good improvement and > > is mostly independent of the "inode caching mode" and > > FOPEN_CACHE_MMAP idea that I suggested. > > > > The only thing that my idea changes is replacing the > > FUSE_I_CACHE_WRITES state with a FUSE_I_CACHE_IO_MODE > > state, which is set earlier than FUSE_I_CACHE_WRITES > > on caching file open or first direct_io mmap and unlike > > FUSE_I_CACHE_WRITES, it is cleared on the last file close. > > > > FUSE_I_CACHE_WRITES means that caching writes happened. > > FUSE_I_CACHE_IO_MODE means the caching writes and reads > > may happen. > > > > FOPEN_PARALLEL_DIRECT_WRITES obviously shouldn't care > > about "caching reads may happen", but IMO that is a small trade off > > to make for maintaining the same state for > > "do not allow parallel dio" and "do not allow passthrough open". > > I think the attached patches should do, it now also unsets IMO, your patch is still more complicated than it should be. There is no need for the complicated retest state machine. If you split the helpers to: bool exclusive_lock fuse_dio_wr_needs_exclusive_lock(); ... fuse_dio_lock_inode(iocb, &exclusive); ... fuse_dio_unlock_inode(iocb, &exclusive); Then you only need to test FUSE_I_CACHE_IO_MODE in fuse_dio_wr_needs_exclusive_lock() and you only need to increment shared_lock_direct_io_ctr after taking shared lock and re-testing FUSE_I_CACHE_IO_MODE. > FUSE_I_CACHE_IO_MODE. Setting the flag actually has to be done from > fuse_file_mmap (and not from fuse_send_writepage) to avoid a dead stall, > but that aligns with passthrough anyway? Yes. I see that shared_lock_direct_io_ctr is checked without lock or barriers in and the wait_event() should be interruptible. I am also not sure if it breaks any locking order for mmap because the task that is going to wake it up is holding the shared inode lock... While looking at this code, the invalidate_inode_pages2() looks suspicious. If inode is already in FUSE_I_CACHE_IO_MODE when performing another mmap, doesn't that have potential for data loss? (even before your patch I mean) > Amir, right now it only sets > FUSE_I_CACHE_IO_MODE for VM_MAYWRITE. Maybe you could add a condition > for passthrough there? > We could add a condition, but I don't think that we should. I think we should refrain from different behavior when it is not justified. I think it is not justified to allow parallel dio if any file is open in caching mode on the inode and any mmap (private or shared) exists on the inode. That means that FUSE_I_CACHE_IO_MODE should be set on any mmap, and already on open for non direct_io files. Mixing caching and direct io on the same inode is hard as it is already and there is no need to add complexity by allowing parallel dio in that case. IMO it wins us nothing. The FUSE_I_CACHE_IO_MODE could be cleared on last file close (as your patch did) but it could be cleared earlier if instead of tracking refcount of open file, we track refcount of files open in caching mode or mmaped, which is what the FOPEN_MMAP_CACHE flag I suggested is for. Not sure this is a big win over refount of open files, which is simpler. The use case is a db file which is open with concurrent dio writers and some 3rd party app decides that it wants to mmap this file for some other reason (indexer, virus scan, whatnot) and will taint the inode with FUSE_I_CACHE_IO_MODE and degrade db performance until db closes the file. > @Miklos, could please tell me how to move forward? I definitely need to > rebase to fuse-next, but my question is if this patch here should > replace Amirs fix (and get back ported) or if we should apply it on top > of Amirs patch and so let that simple fix get back ported? Given this is > all features and new flags - I'm all for for the simple fix. > If you agree on the general approach, I can put this on top of my dio > consolidate branch and rebase the rest of the patches on top of it. That > part will get a bit more complicated, as we will also need to handle > plain O_DIRECT. > I was planning to post a patch for FUSE_I_CACHE_IO_MODE myself, but feel free to work on your version and we could decide which parts to take from which patch at the end. Thanks, Amir.