Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 30, 2020 at 12:20 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Fri 26-06-20 14:06:37, Amir Goldstein wrote:
> > On Wed, Apr 17, 2019 at 2:30 PM Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > > problems we face with applications using fanotify permission events and
> > > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > > you want to use such hooks for recording (and persisting) that some change
> > > > > is going to happen and provide crash-consistency guarantees for such
> > > > > journal?
> > > > >
> > > >
> > > > That's the general idea.
> > > > I have two use cases for pre-modification hooks:
> > > > 1. VFS level snapshots
> > > > 2. persistent change tracking
> > > >
> > > > TBH, I did not consider implementing any of the above in userspace,
> > > > so I do not have a specific interest in extending the fanotify API.
> > > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > > that a snapshot or change tracking subsystem can register with.
> > > > An in-kernel fsnotify event handler can set a flag in current task
> > > > struct to circumvent system deadlocks on nested filesystem access.
> > >
> > > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > > handlers stay within the kernel, I'm fine with that. After all this is what
> > > LSMs are already doing. Just exposing this to userspace for arbitration is
> > > what I have a problem with.
> > >
> >
> > Short update on that.
> >
> > I decided to ditch the LSM hooks approach because I realized that for
> > the purpose of persistent change tracking, the pre-modify hooks need
> > to be called before the caller is taking filesystem locks.
> >
> > So I added hooks inside mnt_want_write and file_start_write wrappers:
> > https://github.com/amir73il/linux/commits/fsnotify_pre_modify
>
> FWIW I've glanced through the series. I like the choice of mnt_want_write()
> and file_start_write() as a place to generate the event. I somewhat dislike

Thanks. I was looking for this initial feedback to know if direction in sane.

> the number of variants you have to introduce and then pass NULL in some
> places because you don't have the info available and then it's not
> immediately clear what semantics the event consumers can expect... That
> would be good to define and then verify in the code.
>

I am not sure I understand what you mean.
Did you mean that mnt_want_write_at() mnt_want_write_path() should be
actual functions instead of inline wrappers or something else?

> Also given you have the requirement "no fs locks on event generation", I'm
> not sure how reliable this can be. If you don't hold fs locks when
> generating event, cannot it happen that actually modified object is
> different from the reported one because we raced with some other fs
> operations? And can we prove that? So what exactly is the usecase and
> guarantees the event needs to provide?
>

That's a good question. Answer is not trivial.
The use case is "persistent change tracking snapshot".
"snapshot" because it tracks ALL changes since a point in time -
there is no concept of "consuming" events.
It is important to note that this is complementary to real time fs events.
A user library may combine the two mechanisms to a stream of changes
(either recorded or live), but that is out of scope for this effort.
Also, userspace would likely create periodic snapshots, so that e.g.
current snapshot records changes, while previous snapshot recorded
changes are being scanned.

The concept is to record every dir fid *before* an immediate child or directory
metadata itself may change, so that after a crash, all recorded dir fids
may be scanned to search for possibly missed changes.

The dir fid is stable, so races are not an issue in that respect.
When name is recorded, change tracking never examines the object at that
name, it just records the fact that there has been a change at [dir fid;name].
This is mostly needed to track creates.

Other than that, races should be handled by the backend itself, so proof is
pending the completion of the backend POC, but in hand waving:
- All name changes in the filesystem call the backend before the change
  (because backend marks sb) and backend is responsible for locking
against races
- My current implementation uses overlayfs upper/index as the change
  track storage, which has the benefit that the test "is change recorded"
  is implemented by decode_fh and/or name lookup, so it is already very much
  optimized by inode and dentry cache and shouldn't need any locking for
  most  pre_modify calls
- It is also not a coincidence that overlayfs changes to upper fs do not
  trigger pre_modify hooks because that prevents the feedback loop.
  I wrote in commit message that "is consistent with the fact that overlayfs
  sets the FMODE_NONOTIFY flag on underlying open files" - that is needed
  because the path in underlying files is "fake" (open_with_fake_path()).

If any of this hand waving sounds terribly wrong please let me know.
Otherwise I will report back after POC is complete with a example backend.

Thanks,
Amir.



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux