On Fri, Dec 8, 2023 at 9:50 PM Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote: > > > > On 12/8/23 09:39, Amir Goldstein wrote: > > On Thu, Dec 7, 2023 at 8:38 PM Bernd Schubert > > <bernd.schubert@xxxxxxxxxxx> wrote: > >> > >> > >> > >> On 12/7/23 08:39, Amir Goldstein wrote: > >>> On Thu, Dec 7, 2023 at 1:28 AM Bernd Schubert > >>> <bernd.schubert@xxxxxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> On 12/6/23 09:25, Amir Goldstein wrote: > >>>>>>>> Is it actually important for FUSE_DIRECT_IO_ALLOW_MMAP fs > >>>>>>>> (e.g. virtiofsd) to support FOPEN_PARALLEL_DIRECT_WRITES? > >>>>>>>> I guess not otherwise, the combination would have been tested. > >>>>>>> > >>>>>>> I'm not sure how many people are aware of these different flags/features. > >>>>>>> I had just finalized the backport of the related patches to RHEL8 on > >>>>>>> Friday, as we (or our customers) need both for different jobs. > >>>>>>> > >>>>>>>> > >>>>>>>> FOPEN_PARALLEL_DIRECT_WRITES is typically important for > >>>>>>>> network fs and FUSE_DIRECT_IO_ALLOW_MMAP is typically not > >>>>>>>> for network fs. Right? > >>>>>>> > >>>>>>> We kind of have these use cases for our network file systems > >>>>>>> > >>>>>>> FOPEN_PARALLEL_DIRECT_WRITES: > >>>>>>> - Traditional HPC, large files, parallel IO > >>>>>>> - Large file used on local node as container for many small files > >>>>>>> > >>>>>>> FUSE_DIRECT_IO_ALLOW_MMAP: > >>>>>>> - compilation through gcc (not so important, just not nice when it > >>>>>>> does not work) > >>>>>>> - rather recent: python libraries using mmap _reads_. As it is read > >>>>>>> only no issue of consistency. > >>>>>>> > >>>>>>> > >>>>>>> These jobs do not intermix - no issue as in generic/095. If such > >>>>>>> applications really exist, I have no issue with a serialization penalty. > >>>>>>> Just disabling FOPEN_PARALLEL_DIRECT_WRITES because other > >>>>>>> nodes/applications need FUSE_DIRECT_IO_ALLOW_MMAP is not so nice. > >>>>>>> > >>>>>>> Final goal is also to have FOPEN_PARALLEL_DIRECT_WRITES to work on plain > >>>>>>> O_DIRECT and not only for FUSE_DIRECT_IO - I need to update this branch > >>>>>>> and post the next version > >>>>>>> https://github.com/bsbernd/linux/commits/fuse-dio-v4 > >>>>>>> > >>>>>>> > >>>>>>> In the mean time I have another idea how to solve > >>>>>>> FOPEN_PARALLEL_DIRECT_WRITES + FUSE_DIRECT_IO_ALLOW_MMAP > >>>>>> > >>>>>> Please find attached what I had in my mind. With that generic/095 is not > >>>>>> crashing for me anymore. I just finished the initial coding - it still > >>>>>> needs a bit cleanup and maybe a few comments. > >>>>>> > >>>>> > >>>>> Nice. I like the FUSE_I_CACHE_WRITES state. > >>>>> For FUSE_PASSTHROUGH I will need to track if inode is open/mapped > >>>>> in caching mode, so FUSE_I_CACHE_WRITES can be cleared on release > >>>>> of the last open file of the inode. > >>>>> > >>>>> I did not understand some of the complexity here: > >>>>> > >>>>>> /* The inode ever got page writes and we do not know for sure > >>>>>> * in the DIO path if these are pending - shared lock not possible */ > >>>>>> spin_lock(&fi->lock); > >>>>>> if (!test_bit(FUSE_I_CACHE_WRITES, &fi->state)) { > >>>>>> if (!(*cnt_increased)) { > >>>>> > >>>>> How can *cnt_increased be true here? > >>>> > >>>> I think you missed the 2nd entry into this function, when the shared > >>>> lock was already taken? > >>> > >>> Yeh, I did. > >>> > >>>> I have changed the code now to have all > >>>> complexity in this function (test, lock, retest with lock, release, > >>>> wakeup). I hope that will make it easier to see the intention of the > >>>> code. Will post the new patches in the morning. > >>>> > >>> > >>> Sounds good. Current version was a bit hard to follow. > >>> > >>>> > >>>>> > >>>>>> fi->shared_lock_direct_io_ctr++; > >>>>>> *cnt_increased = true; > >>>>>> } > >>>>>> excl_lock = false; > >>>>> > >>>>> Seems like in every outcome of this function > >>>>> *cnt_increased = !excl_lock > >>>>> so there is not need for out arg cnt_increased > >>>> > >>>> If excl_lock would be used as input - yeah, would have worked as well. > >>>> Or a parameter like "retest-under-lock". Code is changed now to avoid > >>>> going in and out. > >>>> > >>>>> > >>>>>> } > >>>>>> spin_unlock(&fi->lock); > >>>>>> > >>>>>> out: > >>>>>> if (excl_lock && *cnt_increased) { > >>>>>> bool wake = false; > >>>>>> spin_lock(&fi->lock); > >>>>>> if (--fi->shared_lock_direct_io_ctr == 0) > >>>>>> wake = true; > >>>>>> spin_unlock(&fi->lock); > >>>>>> if (wake) > >>>>>> wake_up(&fi->direct_io_waitq); > >>>>>> } > >>>>> > >>>>> I don't see how this wake_up code is reachable. > >>>>> > >>>>> TBH, I don't fully understand the expected result. > >>>>> Surely, the behavior of dio mixed with mmap is undefined. Right? > >>>>> IIUC, your patch does not prevent dirtying page cache while dio is in > >>>>> flight. It only prevents writeback while dio is in flight, which is the same > >>>>> behavior as with exclusive inode lock. Right? > >>>> > >>>> Yeah, thanks. I will add it in the patch description. > >>>> > >>>> And there was actually an issue with the patch, as cache flushing needs > >>>> to be initiated before doing the lock decision, fixed now. > >>>> > >>> > >>> I thought there was, because of the wait in fuse_send_writepage() > >>> but wasn't sure if I was following the flow correctly. > >>> > >>>>> > >>>>> Maybe this interaction is spelled out somewhere else, but if not > >>>>> better spell it out for people like me that are new to this code. > >>>> > >>>> Sure, thanks a lot for your helpful comments! > >>>> > >>> > >>> Just to be clear, this patch looks like a good improvement and > >>> is mostly independent of the "inode caching mode" and > >>> FOPEN_CACHE_MMAP idea that I suggested. > >>> > >>> The only thing that my idea changes is replacing the > >>> FUSE_I_CACHE_WRITES state with a FUSE_I_CACHE_IO_MODE > >>> state, which is set earlier than FUSE_I_CACHE_WRITES > >>> on caching file open or first direct_io mmap and unlike > >>> FUSE_I_CACHE_WRITES, it is cleared on the last file close. > >>> > >>> FUSE_I_CACHE_WRITES means that caching writes happened. > >>> FUSE_I_CACHE_IO_MODE means the caching writes and reads > >>> may happen. > >>> > >>> FOPEN_PARALLEL_DIRECT_WRITES obviously shouldn't care > >>> about "caching reads may happen", but IMO that is a small trade off > >>> to make for maintaining the same state for > >>> "do not allow parallel dio" and "do not allow passthrough open". > >> > >> I think the attached patches should do, it now also unsets > > > > IMO, your patch is still more complicated than it should be. > > There is no need for the complicated retest state machine. > > If you split the helpers to: > > > > bool exclusive_lock fuse_dio_wr_needs_exclusive_lock(); > > ... > > fuse_dio_lock_inode(iocb, &exclusive); > > ... > > fuse_dio_unlock_inode(iocb, &exclusive); > > > > Then you only need to test FUSE_I_CACHE_IO_MODE in > > fuse_dio_wr_needs_exclusive_lock() > > and you only need to increment shared_lock_direct_io_ctr > > after taking shared lock and re-testing FUSE_I_CACHE_IO_MODE. > > Hmm, I'm not sure. > > I changed fuse_file_mmap() to call this function > > /* > * direct-io with shared locks cannot handle page cache io - set an inode > * flag to disable shared locks and wait until remaining threads are done > */ > static void fuse_file_mmap_handle_dio_writers(struct inode *inode) > { > struct fuse_inode *fi = get_fuse_inode(inode); > > spin_lock(&fi->lock); > set_bit(FUSE_I_CACHE_IO_MODE, &fi->state); > while (fi->shared_lock_direct_io_ctr > 0) { > spin_unlock(&fi->lock); > wait_event_interruptible(fi->direct_io_waitq, > fi->shared_lock_direct_io_ctr == 0); > spin_lock(&fi->lock); > } > spin_unlock(&fi->lock); > } > > > Before we had indeed a race. Idea for fuse_file_mmap_handle_dio_writers() > and fuse_dio_lock_inode() is to either have FUSE_I_CACHE_IO_MODE set, > or fi->shared_lock_direct_io_ctr is greater 0, but that requires that > FUSE_I_CACHE_IO_MODE is checked for when fi->lock is taken. > > > I'm going to think about over the weekend if your suggestion > to increase fi->shared_lock_direct_io_ctr only after taking the shared > lock is possible. Right now I don't see how to do that. > > > > > >> FUSE_I_CACHE_IO_MODE. Setting the flag actually has to be done from > >> fuse_file_mmap (and not from fuse_send_writepage) to avoid a dead stall, > >> but that aligns with passthrough anyway? > > > > Yes. > > > > I see that shared_lock_direct_io_ctr is checked without lock or barriers > > in and the wait_event() should be interruptible. > > Thanks, fixed with the function above. > > > I am also not sure if it breaks any locking order for mmap because > > the task that is going to wake it up is holding the shared inode lock... > > The waitq has its own lock. We have > > fuse_file_mmap - called under some mmap lock, waitq lock > > fuse_dio_lock_inode: no lock taken before calling wakeup > > fuse_direct_write_iter: wakeup after release of all locks > > So I don't think we have a locker issue (lockdep also doesn't annotate > anything). I don't think that lockdep can understand this dependency. > What we definitely cannot do it to take the inode i_rwsem lock in fuse_file_mmap > It's complicated. I need to look at the whole thing again. > > > > While looking at this code, the invalidate_inode_pages2() looks suspicious. > > If inode is already in FUSE_I_CACHE_IO_MODE when performing > > another mmap, doesn't that have potential for data loss? > > (even before your patch I mean) > > > >> Amir, right now it only sets > >> FUSE_I_CACHE_IO_MODE for VM_MAYWRITE. Maybe you could add a condition > >> for passthrough there? > >> > > > > We could add a condition, but I don't think that we should. > > I think we should refrain from different behavior when it is not justified. > > I think it is not justified to allow parallel dio if any file is open in > > caching mode on the inode and any mmap (private or shared) > > exists on the inode. > > > > That means that FUSE_I_CACHE_IO_MODE should be set on > > any mmap, and already on open for non direct_io files. > > Ok, I can change and add that. Doing it in open is definitely needed > for O_DIRECT (in my other dio branch). > Good, the more common code the better. > > > > Mixing caching and direct io on the same inode is hard as it is > > already and there is no need to add complexity by allowing > > parallel dio in that case. IMO it wins us nothing. > > So the slight issue I see are people like me, who check the content > of a file during a long running computation. Like an HPC application > is doing some long term runs. Then in the middle of > the run the user wants to see the current content of the file and > reads it - if that is done through mmap (and from a node that runs > the application), parallel DIO is disabled with the current patch > until the file is closed - I see the use case to check for writes. > That's what I thought. > > > > > The FUSE_I_CACHE_IO_MODE could be cleared on last file > > close (as your patch did) but it could be cleared earlier if > > instead of tracking refcount of open file, we track refcount of > > files open in caching mode or mmaped, which is what the > > FOPEN_MMAP_CACHE flag I suggested is for. > > But how does open() know that a file/fd is used for mmap? > Because what I tried to suggest is a trick/hack: first mmap on direct_io file sets FOPEN_MMAP_CACHE on the file and bumps the cached_opens on the inode as if file was opened in caching mode or in FOPEN_MMAP_CACHE mode. When the file that was used for mmap is closed and all the rest of the open files have only ever been used for direct_io, then inode exists the caching io mode. Using an FOPEN flag for that is kind of a hack. We could add an internal file state bits for that as well, but my thinking was that FOPEN_MMAP_CACHE could really be set by the server to mean per-file ALLOW_MMAP instead of the per-filesystem ALLOW_MMAP. Not sure if that will be useful. Sorry for the hand waving. I was trying to send out a demo patch that explains it better, but got caught up with other things. Thanks, Amir.