On 7/26/23 19:07, Jiachen Zhang wrote:
On 2023/7/26 00:57, Hao Xu wrote:
On 7/25/23 21:00, Bernd Schubert wrote:
On 7/25/23 12:11, Hao Xu wrote:
On 7/21/23 19:56, Bernd Schubert wrote:
On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@xxxxxxxxx>
wrote:
On 7/21/23 14:35, Jiachen Zhang wrote:
On 2023/6/30 17:46, Hao Xu wrote:
From: Hao Xu <howeyxu@xxxxxxxxxxx>
In direct_io_relax mode, there can be shared mmaped files and
thus dirty
pages in its page cache. Therefore those dirty pages should be
written
back to backend before direct write to avoid data loss.
Signed-off-by: Hao Xu <howeyxu@xxxxxxxxxxx>
---
fs/fuse/file.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 176f719f8fc8..7c9167c62bf6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct
fuse_io_priv *io, struct iov_iter *iter,
if (!ia)
return -ENOMEM;
+ if (fopen_direct_write && fc->direct_io_relax) {
Hi,
Seems this patchset flushes and invalidates the page cache before
doing the direct-io writes, which avoids data loss caused by flushing
staled data to FUSE daemon. And I tested it works well.
But there is also another side of the same problem we should consider.
If a file is modified through its page cache (shared mmapped regions,
or non-FOPEN_DIRECT_IO files), the following direct-io reads may
bypass the new data in dirty page cache and read the staled data from
FUSE daemon. I think this is also a problem that should be fixed. It
could be fixed by uncondictionally calling
filemap_write_and_wait_range() before direct-io read.
Yea, I think this is true, I'll fix it in v2. Thanks Jiachen.
+ res = filemap_write_and_wait_range(mapping, pos, pos +
count - 1);
+ if (res) {
+ fuse_io_free(ia);
+ return res;
+ }
+ }
if (!cuse && fuse_range_is_writeback(inode, idx_from,
idx_to)) {
if (!write)
inode_lock(inode);
Tested-by: Jiachen Zhang <zhangjiachen.jaycee@xxxxxxxxxxxxx>
Looks good to me.
By the way, the behaviour would be a first FUSE_WRITE flushing
the page cache, followed by a second FUSE_WRITE doing the direct
IO. In the future, further optimization could be first write
into the page cache and then flush the dirty page to the FUSE
daemon.
I think this makes sense, cannot think of any issue in it for
now, so
I'll do that change and send next version, super thanks, Jiachen!
Thanks,
Hao
Thanks,
Jiachen
On my phone, sorry if mail formatting is not optimal.
Do I understand it right? You want DIO code path copy into pages
and then flush/invalidate these pages? That would be punish DIO
for for the unlikely case there are also dirty pages (discouraged
IO pattern).
Hi Bernd,
I think I don't get what you said, why it is punishment and why
it's discouraged IO pattern?
On my first eyes seeing Jiachen's idea, I was thinking "that sounds
disobeying direct write semantics" because usually direct write is
"flush dirty page -> invalidate page -> write data through to backend"
not "write data to page -> flush dirty page/(writeback data)"
The latter in worst case write data both to page cache and backend
while the former just write to backend and load it to the page cache
when buffered reading. But seems there is no such "standard way" which
says we should implement direct IO in that way.
Hi Hao,
sorry for being brief last week, I was on vacation and
reading/writing some mails on my phone.
With 'punishment' I mean memory copies to the page cache - memory
copies are expensive and DIO should avoid it.
Right now your patch adds filemap_write_and_wait_range(), but we do
not know if it did work (i.e. if pages had to be flushed). So unless
you find a way to get that information, copy to page cache would be
unconditionally - overhead of memory copy even if there are no dirty
pages.
Ah, looks I understood what you mean in my last email reply. Yes,
just like what I said in last email:
[1] flush dirty page --> invalidate page --> write data to backend
This is what we do for direct write right now in kernel, I call
this policy "write-through", since it doesn't care much about the cache.
[2] write data to page cache --> flush dirty page in suitable time
This is "write-back" policy, used by buffered write. Here in
this patch's case, we flush pages synchronously, so it still can be
called direct-write.
Surely, in the worst case, the page is clean, then [2] has one extra
memory copy than [1]. But like what I pointed out, for [2], next time
buffered
read happens, the page is in latest state, so no I/O needed, while
for [1], it has to load data from backend to page cache.
Write-through, write-back and direct-io are also exlained in the
kernel documentation [*], of which write-through and write-back are
cache modes. According to the document, the pattern [2] is similar to
the FUSE write-back mode, but the pattern [1] is different from the
FUSE write-through mode. The FUSE write-through mode obeys the 'write
data to page cache --> flush dirty page synchronously' (let us call it
pattern [3]), which keeps the clean cache in-core after flushing.
To improve performance while keeping the direct-io semantics, my
thoughts was in the future, maybe we can fallback to the pattern [3]
if the target page is in-core, otherwise keep the original direct-io
pattern without reading from whole pages from FUSE daemon.
[*] https://www.kernel.org/doc/Documentation/filesystems/fuse-io.txt
Thanks,
Jiachen
With 'discouraged' I mean mix of page cache and direct-io. Typically
one should only do either of both (page cache or DIO), but not a mix
of them. For example see your patch, it flushes the page cache, but
without a lock - races are possible. Copying to the page cache might
be a solution, but it has the overhead above.
For race, we held inode lock there, do I miss anything?
Thanks,
Bernd
I now think it's good to keep the pattern same as other filesystems
which is [1] to avoid possible performance issues in the future,
thanks Bernd.
Hao