Re: thoughts about fanotify and HSM

Jan Kara <jack@xxxxxxx> · Wed, 23 Nov 2022 11:49:20 +0100

On Thu 17-11-22 14:38:51, Amir Goldstein wrote:
> > > > > The checkpoint would then do:
> > > > > start gathering changes for both T and T+1
> > > > > clear ignore marks
> > > > > synchronize_srcu()
> > > > > stop gathering changes for T and report them
> > > > >
> > > > > And in this case we would not need POST_WRITE as an event.
> > > > >
> > > >
> > > > Why then give up on the POST_WRITE events idea?
> > > > Don't you think it could work?
> > >
> > > So as we are discussing, the POST_WRITE event is not useful when we want to
> > > handle crash safety. And if we have some other mechanism (like SRCU) which
> > > is able to guarantee crash safety, then what is the benefit of POST_WRITE?
> > > I'm not against POST_WRITE, I just don't see much value in it if we have
> > > another mechanism to deal with events straddling checkpoint.
> > >
> >
> > Not sure I follow.
> >
> > I think that crash safety can be achieved also with PRE/POST_WRITE:
> > - PRE_WRITE records an intent to write in persistent snapshot T
> >   and add to in-memory map of in-progress writes of period T
> > - When "checkpoint T" starts, new PRE_WRITES are recorded in both
> >   T and T+1 persistent snapshots, but event is added only to
> >   in-memory map of in-progress writes of period T+1
> > - "checkpoint T" ends when all in-progress writes of T are completed
> >
> > The trick with alternating snapshots "handover" is this
> > (perhaps I never explained it and I need to elaborate on the wiki [1]):
> >
> > [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query
> >
> > The changed files query results need to include recorded changes in both
> > "finalizing" snapshot T and the new snapshot T+1 that was started in
> > the beginning of the query.
> >
> > Snapshot T MUST NOT be discarded until checkpoint/handover
> > is complete AND the query results that contain changes recorded
> > in T and T+1 snapshots have been consumed.
> >
> > When the consumer ACKs that the query results have been safely stored
> > or acted upon (I called this operation "bless" snapshot T+1) then and
> > only then can snapshot T be discarded.
> >
> > After snapshot T is discarded a new query will start snapshot T+2.
> > A changed files query result includes the id of the last blessed snapshot.
> >
> > I think this is more or less equivalent to the SRCU that you suggested,
> > but all the work is done in userspace at application level.
> >
> > If you see any problem with this scheme or don't understand it
> > please let me know and I will try to explain better.
> >
> 
> Hmm I guess "crash safety" is not well defined.
> You and I were talking about "system crash" and indeed, this was
> my only concern with kernel implementation of overlayfs watch.
> 
> But with userspace HSM service, how can we guarantee that
> modifications did not happen while the service is down?
> 
> I don't really have a good answer for this.

Very good point!

> Thinking out loud, we would somehow need to make the default
> permission deny for all modifications, maybe through some mount
> property (e.g. MOUNT_ATTR_PROT_READ), causing the pre-write
> hooks to default to EROFS if there is no "vfs filter" mount mark.
> 
> Then it will be possible to expose a "safe" mount to users, where
> modifications can never go unnoticed even when HSM service
> crashes.

Yeah, something like this. Although the bootstrap of this during mount may
be a bit challenging. But maybe not.

Also I'm thinking about other usecases - for HSM I agree we essentially
need to take the FS down if the userspace counterpart is not working. What
about other persistent change log usecases? Do we mandate that there is
only one "persistent change log" daemon in the system (or per filesystem?)
and that must be running or we take the filesystem down? And anybody who
wants reliable notifications needs to consume service of this daemon?

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR