Hi Nikolaus,
On 08/05/2017 01:45 PM, Nikolaus Rath wrote:
On Aug 04 2017, Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
On Fri, Aug 4, 2017 at 9:10 PM, Nikolaus Rath <Nikolaus@xxxxxxxx> wrote:
Hello,
I am confused about how O_APPEND is supposed to interact with the
writeback cache.
As far as I can tell, the O_APPEND flag is currently passed to the
filesystem process, so my expectation is that the filesystem process is
responsible for ignoring any offset in write requests and instead write
at the current end of the file[1].
However, with writeback cache enabled the filesystem process cannot tell
which data is "new" and came from userspace, should be appended, and
which data is old and just made a round-trip to the kernel. So it seems
to me that the filesystem process should probably leave the handling of
O_APPEND to the kernel. But then, shouldn't the kernel filter out this
flag when sending the open request?
Indeed, when writing back the cache the kernel should definitely not
set O_APPEND.
Well, 4.9 certainly does it though. Should I try to make a patch, or are
you or Maxim going to do that shortly anyway?
Do you think it makes sense to filter out O_APPEND in libfuse as well
(to work around the issue for present day kernels)?
I think it's up to filesystem how to handle O_APPEND. The kernel
shouldn't filter it out.
On the other hand, when the kernel handles O_APPEND, then it is no
longer atomic (think of a network fuse filesystem).
Yes, network filesystem generally needs to handle consistency of
caches across nodes and O_APPEND in no exception (i.e. you cannot have
two nodes writing O_APPEND to cache at the same time, because that
will not work).
This poses a bit of a problem though. So a network filesystem either
cannot use writeback caching or O_APPEND will (silently) not work.
With the current behavior (O_APPEND being passed to open() when
writeback is enabled) the filesystem would at least have a chance to
return an error, i.e. instead of a silent failure there would be a noisy
error. With that in mind, maybe the current behavior isn't so bad? We'd
just have to document that if writeback cache is enabled and O_APPEND
is received, the filesystem has to decide if it is fine with the kernel
handling O_APPEND (and in that case ignore the flag for subsequent
writes) or return an error.
Yes, I agree. For some filesystems O_APPEND is problematic, for others
not. Let them decide.
Thanks,
Maxim