On Thu, Dec 7, 2023 at 1:28 AM Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote: > > > > On 12/6/23 09:25, Amir Goldstein wrote: > >>>> Is it actually important for FUSE_DIRECT_IO_ALLOW_MMAP fs > >>>> (e.g. virtiofsd) to support FOPEN_PARALLEL_DIRECT_WRITES? > >>>> I guess not otherwise, the combination would have been tested. > >>> > >>> I'm not sure how many people are aware of these different flags/features. > >>> I had just finalized the backport of the related patches to RHEL8 on > >>> Friday, as we (or our customers) need both for different jobs. > >>> > >>>> > >>>> FOPEN_PARALLEL_DIRECT_WRITES is typically important for > >>>> network fs and FUSE_DIRECT_IO_ALLOW_MMAP is typically not > >>>> for network fs. Right? > >>> > >>> We kind of have these use cases for our network file systems > >>> > >>> FOPEN_PARALLEL_DIRECT_WRITES: > >>> - Traditional HPC, large files, parallel IO > >>> - Large file used on local node as container for many small files > >>> > >>> FUSE_DIRECT_IO_ALLOW_MMAP: > >>> - compilation through gcc (not so important, just not nice when it > >>> does not work) > >>> - rather recent: python libraries using mmap _reads_. As it is read > >>> only no issue of consistency. > >>> > >>> > >>> These jobs do not intermix - no issue as in generic/095. If such > >>> applications really exist, I have no issue with a serialization penalty. > >>> Just disabling FOPEN_PARALLEL_DIRECT_WRITES because other > >>> nodes/applications need FUSE_DIRECT_IO_ALLOW_MMAP is not so nice. > >>> > >>> Final goal is also to have FOPEN_PARALLEL_DIRECT_WRITES to work on plain > >>> O_DIRECT and not only for FUSE_DIRECT_IO - I need to update this branch > >>> and post the next version > >>> https://github.com/bsbernd/linux/commits/fuse-dio-v4 > >>> > >>> > >>> In the mean time I have another idea how to solve > >>> FOPEN_PARALLEL_DIRECT_WRITES + FUSE_DIRECT_IO_ALLOW_MMAP > >> > >> Please find attached what I had in my mind. With that generic/095 is not > >> crashing for me anymore. I just finished the initial coding - it still > >> needs a bit cleanup and maybe a few comments. > >> > > > > Nice. I like the FUSE_I_CACHE_WRITES state. > > For FUSE_PASSTHROUGH I will need to track if inode is open/mapped > > in caching mode, so FUSE_I_CACHE_WRITES can be cleared on release > > of the last open file of the inode. > > > > I did not understand some of the complexity here: > > > >> /* The inode ever got page writes and we do not know for sure > >> * in the DIO path if these are pending - shared lock not possible */ > >> spin_lock(&fi->lock); > >> if (!test_bit(FUSE_I_CACHE_WRITES, &fi->state)) { > >> if (!(*cnt_increased)) { > > > > How can *cnt_increased be true here? > > I think you missed the 2nd entry into this function, when the shared > lock was already taken? Yeh, I did. > I have changed the code now to have all > complexity in this function (test, lock, retest with lock, release, > wakeup). I hope that will make it easier to see the intention of the > code. Will post the new patches in the morning. > Sounds good. Current version was a bit hard to follow. > > > > >> fi->shared_lock_direct_io_ctr++; > >> *cnt_increased = true; > >> } > >> excl_lock = false; > > > > Seems like in every outcome of this function > > *cnt_increased = !excl_lock > > so there is not need for out arg cnt_increased > > If excl_lock would be used as input - yeah, would have worked as well. > Or a parameter like "retest-under-lock". Code is changed now to avoid > going in and out. > > > > >> } > >> spin_unlock(&fi->lock); > >> > >> out: > >> if (excl_lock && *cnt_increased) { > >> bool wake = false; > >> spin_lock(&fi->lock); > >> if (--fi->shared_lock_direct_io_ctr == 0) > >> wake = true; > >> spin_unlock(&fi->lock); > >> if (wake) > >> wake_up(&fi->direct_io_waitq); > >> } > > > > I don't see how this wake_up code is reachable. > > > > TBH, I don't fully understand the expected result. > > Surely, the behavior of dio mixed with mmap is undefined. Right? > > IIUC, your patch does not prevent dirtying page cache while dio is in > > flight. It only prevents writeback while dio is in flight, which is the same > > behavior as with exclusive inode lock. Right? > > Yeah, thanks. I will add it in the patch description. > > And there was actually an issue with the patch, as cache flushing needs > to be initiated before doing the lock decision, fixed now. > I thought there was, because of the wait in fuse_send_writepage() but wasn't sure if I was following the flow correctly. > > > > Maybe this interaction is spelled out somewhere else, but if not > > better spell it out for people like me that are new to this code. > > Sure, thanks a lot for your helpful comments! > Just to be clear, this patch looks like a good improvement and is mostly independent of the "inode caching mode" and FOPEN_CACHE_MMAP idea that I suggested. The only thing that my idea changes is replacing the FUSE_I_CACHE_WRITES state with a FUSE_I_CACHE_IO_MODE state, which is set earlier than FUSE_I_CACHE_WRITES on caching file open or first direct_io mmap and unlike FUSE_I_CACHE_WRITES, it is cleared on the last file close. FUSE_I_CACHE_WRITES means that caching writes happened. FUSE_I_CACHE_IO_MODE means the caching writes and reads may happen. FOPEN_PARALLEL_DIRECT_WRITES obviously shouldn't care about "caching reads may happen", but IMO that is a small trade off to make for maintaining the same state for "do not allow parallel dio" and "do not allow passthrough open". Thanks, Amir.