Re: thoughts about fanotify and HSM

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 17 Nov 2022 14:38:51 +0200

> > > > The checkpoint would then do:
> > > > start gathering changes for both T and T+1
> > > > clear ignore marks
> > > > synchronize_srcu()
> > > > stop gathering changes for T and report them
> > > >
> > > > And in this case we would not need POST_WRITE as an event.
> > > >
> > >
> > > Why then give up on the POST_WRITE events idea?
> > > Don't you think it could work?
> >
> > So as we are discussing, the POST_WRITE event is not useful when we want to
> > handle crash safety. And if we have some other mechanism (like SRCU) which
> > is able to guarantee crash safety, then what is the benefit of POST_WRITE?
> > I'm not against POST_WRITE, I just don't see much value in it if we have
> > another mechanism to deal with events straddling checkpoint.
> >
>
> Not sure I follow.
>
> I think that crash safety can be achieved also with PRE/POST_WRITE:
> - PRE_WRITE records an intent to write in persistent snapshot T
>   and add to in-memory map of in-progress writes of period T
> - When "checkpoint T" starts, new PRE_WRITES are recorded in both
>   T and T+1 persistent snapshots, but event is added only to
>   in-memory map of in-progress writes of period T+1
> - "checkpoint T" ends when all in-progress writes of T are completed
>
> The trick with alternating snapshots "handover" is this
> (perhaps I never explained it and I need to elaborate on the wiki [1]):
>
> [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query
>
> The changed files query results need to include recorded changes in both
> "finalizing" snapshot T and the new snapshot T+1 that was started in
> the beginning of the query.
>
> Snapshot T MUST NOT be discarded until checkpoint/handover
> is complete AND the query results that contain changes recorded
> in T and T+1 snapshots have been consumed.
>
> When the consumer ACKs that the query results have been safely stored
> or acted upon (I called this operation "bless" snapshot T+1) then and
> only then can snapshot T be discarded.
>
> After snapshot T is discarded a new query will start snapshot T+2.
> A changed files query result includes the id of the last blessed snapshot.
>
> I think this is more or less equivalent to the SRCU that you suggested,
> but all the work is done in userspace at application level.
>
> If you see any problem with this scheme or don't understand it
> please let me know and I will try to explain better.
>

Hmm I guess "crash safety" is not well defined.
You and I were talking about "system crash" and indeed, this was
my only concern with kernel implementation of overlayfs watch.

But with userspace HSM service, how can we guarantee that
modifications did not happen while the service is down?

I don't really have a good answer for this.

Thinking out loud, we would somehow need to make the default
permission deny for all modifications, maybe through some mount
property (e.g. MOUNT_ATTR_PROT_READ), causing the pre-write
hooks to default to EROFS if there is no "vfs filter" mount mark.

Then it will be possible to expose a "safe" mount to users, where
modifications can never go unnoticed even when HSM service
crashes.

Thanks,
Amir.