On 12/5/23 15:01, Bernd Schubert wrote:
On 12/5/23 08:00, Amir Goldstein wrote:
On Tue, Dec 5, 2023 at 1:42 AM Bernd Schubert
<bernd.schubert@xxxxxxxxxxx> wrote:
On 12/4/23 11:04, Bernd Schubert wrote:
On 12/4/23 10:27, Miklos Szeredi wrote:
On Mon, 4 Dec 2023 at 07:50, Amir Goldstein <amir73il@xxxxxxxxx>
wrote:
On Mon, Dec 4, 2023 at 1:00 AM Bernd Schubert
<bernd.schubert@xxxxxxxxxxx> wrote:
Hi Amir,
On 12/3/23 12:20, Amir Goldstein wrote:
On Sat, Dec 2, 2023 at 5:06 PM Amir Goldstein <amir73il@xxxxxxxxx>
wrote:
On Mon, Nov 6, 2023 at 4:08 PM Bernd Schubert
<bernd.schubert@xxxxxxxxxxx> wrote:
Hi Miklos,
On 9/20/23 10:15, Miklos Szeredi wrote:
On Wed, 20 Sept 2023 at 04:41, Tyler Fanelli
<tfanelli@xxxxxxxxxx> wrote:
At the moment, FUSE_INIT's DIRECT_IO_RELAX flag only serves the
purpose
of allowing shared mmap of files opened/created with DIRECT_IO
enabled.
However, it leaves open the possibility of further relaxing the
DIRECT_IO restrictions (and in-effect, the cache coherency
guarantees of
DIRECT_IO) in the future.
The DIRECT_IO_ALLOW_MMAP flag leaves no ambiguity of its
purpose. It
only serves to allow shared mmap of DIRECT_IO files, while
still
bypassing the cache on regular reads and writes. The shared
mmap is the
only loosening of the cache policy that can take place with the
flag.
This removes some ambiguity and introduces a more stable flag
to be used
in FUSE_INIT. Furthermore, we can document that to allow shared
mmap'ing
of DIRECT_IO files, a user must enable DIRECT_IO_ALLOW_MMAP.
Tyler Fanelli (2):
fs/fuse: Rename DIRECT_IO_RELAX to DIRECT_IO_ALLOW_MMAP
docs/fuse-io: Document the usage of DIRECT_IO_ALLOW_MMAP
Looks good.
Applied, thanks. Will send the PR during this merge window,
since the
rename could break stuff if already released.
I'm just porting back this feature to our internal fuse module
and it
looks these rename patches have been forgotten?
Hi Miklos, Bernd,
I was looking at the DIRECT_IO_ALLOW_MMAP code and specifically at
commit b5a2a3a0b776 ("fuse: write back dirty pages before direct
write in
direct_io_relax mode") and I was wondering - isn't dirty pages
writeback
needed *before* invalidate_inode_pages2() in fuse_file_mmap() for
direct_io_allow_mmap case?
For FUSE_PASSTHROUGH, I am going to need to call fuse_vma_close()
for munmap of files also in direct-io mode [1], so I was
considering installing
fuse_file_vm_ops for the FOPEN_DIRECT_IO case, same as caching
case,
and regardless of direct_io_allow_mmap.
I was asking myself if there was a good reason why
fuse_page_mkwrite()/
fuse_wait_on_page_writeback()/fuse_vma_close()/write_inode_now()
should NOT be called for the FOPEN_DIRECT_IO case regardless of
direct_io_allow_mmap?
Before trying to make changes to fuse_file_mmap() I tried to test
DIRECT_IO_RELAX - I enabled it in libfuse and ran fstest with
passthrough_hp --direct-io.
The test generic/095 - "Concurrent mixed I/O (buffer I/O, aiodio,
mmap, splice)
on the same files" blew up hitting BUG_ON(fi->writectr < 0) in
fuse_set_nowrite()
I am wondering how this code was tested?
I could not figure out the problem and how to fix it.
Please suggest a fix and let me know which adjustments are needed
if I want to use fuse_file_vm_ops for all mmap modes.
So fuse_set_nowrite() tests for inode_is_locked(), but that also
succeeds for a shared lock. It gets late here (and I might miss
something), but I think we have an issue with
FOPEN_PARALLEL_DIRECT_WRITES. Assuming there would be plain O_DIRECT
and
mmap, the same issue might triggered? Hmm, well, so far plain
O_DIRECT
does not support FOPEN_PARALLEL_DIRECT_WRITES yet - the patches for
that
are still pending.
Your analysis seems to be correct.
Attached patch fixes the problem and should be backported to 6.6.y.
Miklos,
I prepared the patch on top of master and not on top of the rename to
FUSE_DIRECT_IO_ALLOW_MMAP in for-next for ease of backport to
6.6.y, although if you are planning send the flag rename to v6.7 as a
fix,
you may prefer to apply the fix after the rename and request to
backport
the flag rename along with the fix to 6.6.y.
I've done that. Thanks for the fix and testing.
Hi Amir, hi Miklos,
could you please hold on a bit before sending the patch upstream?
I think we can just test for fuse_range_is_writeback in
fuse_direct_write_iter. I will have a patch in a few minutes.
Hmm, that actually doesn't work as we would need to hold the inode lock
in page write functions.
Then tried to do it per inode and only when the inode gets cached writes
or mmap - this triggers a lockdep lock order warning, because
fuse_file_mmap is called with mm->mmap_lock and would take the inode
lock. But through
fuse_direct_io/iov_iter_get_pages2/__iov_iter_get_pages_alloc these
locks are taken the other way around.
So right now I don't see a way out - we need to go with Amirs patch
first.
Is it actually important for FUSE_DIRECT_IO_ALLOW_MMAP fs
(e.g. virtiofsd) to support FOPEN_PARALLEL_DIRECT_WRITES?
I guess not otherwise, the combination would have been tested.
I'm not sure how many people are aware of these different flags/features.
I had just finalized the backport of the related patches to RHEL8 on
Friday, as we (or our customers) need both for different jobs.
FOPEN_PARALLEL_DIRECT_WRITES is typically important for
network fs and FUSE_DIRECT_IO_ALLOW_MMAP is typically not
for network fs. Right?
We kind of have these use cases for our network file systems
FOPEN_PARALLEL_DIRECT_WRITES:
- Traditional HPC, large files, parallel IO
- Large file used on local node as container for many small files
FUSE_DIRECT_IO_ALLOW_MMAP:
- compilation through gcc (not so important, just not nice when it
does not work)
- rather recent: python libraries using mmap _reads_. As it is read
only no issue of consistency.
These jobs do not intermix - no issue as in generic/095. If such
applications really exist, I have no issue with a serialization penalty.
Just disabling FOPEN_PARALLEL_DIRECT_WRITES because other
nodes/applications need FUSE_DIRECT_IO_ALLOW_MMAP is not so nice.
Final goal is also to have FOPEN_PARALLEL_DIRECT_WRITES to work on plain
O_DIRECT and not only for FUSE_DIRECT_IO - I need to update this branch
and post the next version
https://github.com/bsbernd/linux/commits/fuse-dio-v4
In the mean time I have another idea how to solve
FOPEN_PARALLEL_DIRECT_WRITES + FUSE_DIRECT_IO_ALLOW_MMAP
Please find attached what I had in my mind. With that generic/095 is not
crashing for me anymore. I just finished the initial coding - it still
needs a bit cleanup and maybe a few comments.
Thanks,
Bernd
fuse: Create helper function if DIO write needs exclusive lock
From: Bernd Schubert <bschubert@xxxxxxx>
This is just a preparation for follow up patches.
Cc: Hao Xu <howeyxu@xxxxxxxxxxx>
Cc: Miklos Szeredi <miklos@xxxxxxxxxx>
Cc: Dharmendra Singh <dsingh@xxxxxxx>
Cc: Amir Goldstein <amir73il@xxxxxxxxx>
Signed-off-by: Bernd Schubert <bschubert@xxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Preparation for Fixes: 153524053bbb ("fuse: allow non-extending parallel direct writes on the same file")
---
fs/fuse/file.c | 53 ++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 36 insertions(+), 17 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1cdb6327511e..60d4e1e50843 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1298,6 +1298,38 @@ static ssize_t fuse_perform_write(struct kiocb *iocb, struct iov_iter *ii)
return res;
}
+static bool fuse_io_past_eof(struct kiocb *iocb, struct iov_iter *iter)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ return iocb->ki_pos + iov_iter_count(iter) > i_size_read(inode);
+}
+
+/*
+ * @return true if an exclusive lock for direct IO writes is needed
+ */
+static bool fuse_dio_wr_exclusive_lock(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct file *file = iocb->ki_filp;
+ struct fuse_file *ff = file->private_data;
+
+ /* server side has to advise that it supports parallel dio writes */
+ if (!(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES))
+ return true;
+
+ /* append will need to know the eventual eof - always needs an
+ * exclusive lock
+ */
+ if (iocb->ki_flags & IOCB_APPEND)
+ return true;
+
+ /* parallel dio beyond eof is at least for now not supported */
+ if (fuse_io_past_eof(iocb, from))
+ return true;
+
+ return false;
+}
+
static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
@@ -1557,25 +1589,12 @@ static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
return res;
}
-static bool fuse_direct_write_extending_i_size(struct kiocb *iocb,
- struct iov_iter *iter)
-{
- struct inode *inode = file_inode(iocb->ki_filp);
-
- return iocb->ki_pos + iov_iter_count(iter) > i_size_read(inode);
-}
-
static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
- struct file *file = iocb->ki_filp;
- struct fuse_file *ff = file->private_data;
struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
ssize_t res;
- bool exclusive_lock =
- !(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) ||
- iocb->ki_flags & IOCB_APPEND ||
- fuse_direct_write_extending_i_size(iocb, from);
+ bool exclusive_lock = fuse_dio_wr_exclusive_lock(iocb, from);
/*
* Take exclusive lock if
@@ -1588,10 +1607,10 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
else {
inode_lock_shared(inode);
- /* A race with truncate might have come up as the decision for
- * the lock type was done without holding the lock, check again.
+ /*
+ * Previous check was without any lock and might have raced.
*/
- if (fuse_direct_write_extending_i_size(iocb, from)) {
+ if (fuse_dio_wr_exclusive_lock(iocb, from)) {
inode_unlock_shared(inode);
inode_lock(inode);
exclusive_lock = true;
fuse: Test for page cache writes in the shared lock DIO decision
From: Bernd Schubert <bschubert@xxxxxxx>
xfstest generic/095 triggers BUG_ON(fi->writectr < 0) in
fuse_set_nowrite().
This happens with a shared lock for FOPEN_DIRECT_IO and when in parallel
mmap writes happen (FUSE_DIRECT_IO_RELAX is set).
Reason is that multiple DIO writers see that the inode has pending
page IO writes and try to set FUSE_NOWRITE, but this code path requires
serialization. Ideal would be to let fuse_dio_wr_exclusive_lock detect if
there are outstanding writes, but that would require to hold an inode
lock in related page/folio write paths. Another solution would be to disable
the shared inode lock for FOPEN_DIRECT_IO, when FUSE_DIRECT_IO_RELAX is set,
but typically userspace/server side will set these flags for all inodes (or not
at all). Hence, FUSE_DIRECT_IO_RELAX would entirely disable the shared lock and
impose serialization even though no page IO is ever done for inodes.
The solution here stores a flag into the fuse inode, if page writes ever
happened to an inode and only then to enforce the non-shared lock.
Cc: Hao Xu <howeyxu@xxxxxxxxxxx>
Cc: Miklos Szeredi <miklos@xxxxxxxxxx>
Cc: Dharmendra Singh <dsingh@xxxxxxx>
Cc: Amir Goldstein <amir73il@xxxxxxxxx>
Signed-off-by: Bernd Schubert <bschubert@xxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Fixes: 153524053bbb ("fuse: allow non-extending parallel direct writes on the same file")
---
fs/fuse/dir.c | 1 +
fs/fuse/file.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++------
fs/fuse/fuse_i.h | 9 +++++++
3 files changed, 74 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index d19cbf34c634..09aaaa31ae28 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1751,6 +1751,7 @@ void fuse_set_nowrite(struct inode *inode)
struct fuse_inode *fi = get_fuse_inode(inode);
BUG_ON(!inode_is_locked(inode));
+ lockdep_assert_held_write(&inode->i_rwsem);
spin_lock(&fi->lock);
BUG_ON(fi->writectr < 0);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 60d4e1e50843..9959eafca0a0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1308,26 +1308,59 @@ static bool fuse_io_past_eof(struct kiocb *iocb, struct iov_iter *iter)
/*
* @return true if an exclusive lock for direct IO writes is needed
*/
-static bool fuse_dio_wr_exclusive_lock(struct kiocb *iocb, struct iov_iter *from)
+static bool fuse_dio_wr_exclusive_lock(struct kiocb *iocb, struct iov_iter *from,
+ bool *cnt_increased)
{
struct file *file = iocb->ki_filp;
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_file *ff = file->private_data;
+ struct fuse_conn *fc = ff->fm->fc;
+ bool excl_lock = true;
/* server side has to advise that it supports parallel dio writes */
if (!(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES))
- return true;
+ goto out;
/* append will need to know the eventual eof - always needs an
* exclusive lock
*/
if (iocb->ki_flags & IOCB_APPEND)
- return true;
+ goto out;
/* parallel dio beyond eof is at least for now not supported */
if (fuse_io_past_eof(iocb, from))
- return true;
+ goto out;
- return false;
+ /* no need to optimize async requests */
+ if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT &&
+ fc->async_dio)
+ goto out;
+
+ /* The inode ever got page writes and we do not know for sure
+ * in the DIO path if these are pending - shared lock not possible */
+ spin_lock(&fi->lock);
+ if (!test_bit(FUSE_I_CACHE_WRITES, &fi->state)) {
+ if (!(*cnt_increased)) {
+ fi->shared_lock_direct_io_ctr++;
+ *cnt_increased = true;
+ }
+ excl_lock = false;
+ }
+ spin_unlock(&fi->lock);
+
+out:
+ if (excl_lock && *cnt_increased) {
+ bool wake = false;
+ spin_lock(&fi->lock);
+ if (--fi->shared_lock_direct_io_ctr == 0)
+ wake = true;
+ spin_unlock(&fi->lock);
+ if (wake)
+ wake_up(&fi->direct_io_waitq);
+ }
+
+ return excl_lock;
}
static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
@@ -1549,6 +1582,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
break;
}
}
+
if (ia)
fuse_io_free(ia);
if (res > 0)
@@ -1592,9 +1626,12 @@ static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
ssize_t res;
- bool exclusive_lock = fuse_dio_wr_exclusive_lock(iocb, from);
+ bool shared_lock_cnt_inc = false;
+ bool exclusive_lock = fuse_dio_wr_exclusive_lock(iocb, from,
+ &shared_lock_cnt_inc);
/*
* Take exclusive lock if
@@ -1610,7 +1647,8 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
/*
* Previous check was without any lock and might have raced.
*/
- if (fuse_dio_wr_exclusive_lock(iocb, from)) {
+ if (fuse_dio_wr_exclusive_lock(iocb, from,
+ &shared_lock_cnt_inc)) {
inode_unlock_shared(inode);
inode_lock(inode);
exclusive_lock = true;
@@ -1629,8 +1667,17 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
}
if (exclusive_lock)
inode_unlock(inode);
- else
+ else {
+ bool wake = false;
+
inode_unlock_shared(inode);
+ spin_lock(&fi->lock);
+ if (--fi->shared_lock_direct_io_ctr == 0)
+ wake = true;
+ spin_unlock(&fi->lock);
+ if (wake)
+ wake_up(&fi->direct_io_waitq);
+ }
return res;
}
@@ -1719,6 +1766,13 @@ __acquires(fi->lock)
__u64 data_size = wpa->ia.ap.num_pages * PAGE_SIZE;
int err;
+ if (!test_bit(FUSE_I_CACHE_WRITES, &fi->state)) {
+ set_bit(FUSE_I_CACHE_WRITES, &fi->state);
+ spin_unlock(&fi->lock);
+ wait_event(fi->direct_io_waitq, fi->shared_lock_direct_io_ctr == 0);
+ spin_lock(&fi->lock);
+ }
+
fi->writectr++;
if (inarg->offset + data_size <= size) {
inarg->size = data_size;
@@ -3261,7 +3315,9 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
INIT_LIST_HEAD(&fi->write_files);
INIT_LIST_HEAD(&fi->queued_writes);
fi->writectr = 0;
+ fi->shared_lock_direct_io_ctr = 0;
init_waitqueue_head(&fi->page_waitq);
+ init_waitqueue_head(&fi->direct_io_waitq);
fi->writepages = RB_ROOT;
if (IS_ENABLED(CONFIG_FUSE_DAX))
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 6e6e721f421b..febb1f5cd53f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -110,11 +110,17 @@ struct fuse_inode {
* (FUSE_NOWRITE) means more writes are blocked */
int writectr;
+ /* counter of tasks with shared lock direct-io writes */
+ int shared_lock_direct_io_ctr;
+
/* Waitq for writepage completion */
wait_queue_head_t page_waitq;
/* List of writepage requestst (pending or sent) */
struct rb_root writepages;
+
+ /* waitq for direct-io completion */
+ wait_queue_head_t direct_io_waitq;
};
/* readdir cache (directory only) */
@@ -172,6 +178,9 @@ enum {
FUSE_I_BAD,
/* Has btime */
FUSE_I_BTIME,
+ /* Has paged writes */
+ FUSE_I_CACHE_WRITES,
+
};
struct fuse_conn;
/fs/fuse/fuse_i.h
@@ -110,11 +110,17 @@ struct fuse_inode {
* (FUSE_NOWRITE) means more writes are blocked */
int writectr;
+ /* counter of tasks with shared lock direct-io writes */
+ int shared_lock_direct_io_ctr;
+
/* Waitq for writepage completion */
wait_queue_head_t page_waitq;
/* List of writepage requestst (pending or sent) */
struct rb_root writepages;
+
+ /* waitq for direct-io completion */
+ wait_queue_head_t direct_io_waitq;
};
/* readdir cache (directory only) */
@@ -172,6 +178,9 @@ enum {
FUSE_I_BAD,
/* Has btime */
FUSE_I_BTIME,
+ /* Has paged writes */
+ FUSE_I_CACHE_WRITES,
+
};
struct fuse_conn;