Re: thoughts about fanotify and HSM

Jan Kara <jack@xxxxxxx> · Wed, 23 Nov 2022 11:10:21 +0100

On Wed 16-11-22 18:24:06, Amir Goldstein wrote:
> > > Why then give up on the POST_WRITE events idea?
> > > Don't you think it could work?
> >
> > So as we are discussing, the POST_WRITE event is not useful when we want to
> > handle crash safety. And if we have some other mechanism (like SRCU) which
> > is able to guarantee crash safety, then what is the benefit of POST_WRITE?
> > I'm not against POST_WRITE, I just don't see much value in it if we have
> > another mechanism to deal with events straddling checkpoint.
> >
> 
> Not sure I follow.
> 
> I think that crash safety can be achieved also with PRE/POST_WRITE:
> - PRE_WRITE records an intent to write in persistent snapshot T
>   and add to in-memory map of in-progress writes of period T
> - When "checkpoint T" starts, new PRE_WRITES are recorded in both
>   T and T+1 persistent snapshots, but event is added only to
>   in-memory map of in-progress writes of period T+1
> - "checkpoint T" ends when all in-progress writes of T are completed

So maybe I miss something but suppose the situation I was mentioning few
emails earlier:

PRE_WRITE for F			-> F recorded as modified in T
modify F
POST_WRITE for F

PRE_WRITE for F			-> ignored because F is already marked as
				   modified

				-> checkpoint T requested, modified files
				   reported, process modified files
modify F
--------- crash

Now unless filesystem freeze or SRCU is part of checkpoint, we will never
notify about the last modification to F. So I don't see how PRE +
POST_WRITE alone can achieve crash safety...

And if we use filesystem freeze or SRCU as part of checkpoint, then
processing of POST_WRITE events does not give us anything new. E.g.
synchronize_srcu() during checkpoing before handing out list of modified
files makes sure all modifications to files for which PRE_MODIFY events
were generated (and thus are listed as modified in checkpoint T) are
visible for userspace.

So am I missing some case where POST_WRITE would be more useful than SRCU?
Because at this point I'd rather implement SRCU than POST_WRITE.

> The trick with alternating snapshots "handover" is this
> (perhaps I never explained it and I need to elaborate on the wiki [1]):
> 
> [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query
> 
> The changed files query results need to include recorded changes in both
> "finalizing" snapshot T and the new snapshot T+1 that was started in
> the beginning of the query.
> 
> Snapshot T MUST NOT be discarded until checkpoint/handover
> is complete AND the query results that contain changes recorded
> in T and T+1 snapshots have been consumed.
> 
> When the consumer ACKs that the query results have been safely stored
> or acted upon (I called this operation "bless" snapshot T+1) then and
> only then can snapshot T be discarded.
> 
> After snapshot T is discarded a new query will start snapshot T+2.
> A changed files query result includes the id of the last blessed snapshot.
> 
> I think this is more or less equivalent to the SRCU that you suggested,
> but all the work is done in userspace at application level.
> 
> If you see any problem with this scheme or don't understand it
> please let me know and I will try to explain better.

So until now I was imagining that query results will be returned like a one
big memcpy. I.e. one off event where the "persistent log daemon" hands over
the whole contents of checkpoint T to the client. Whatever happens with the
returned data is the bussiness of the client, whatever happens with the
checkpoint T records in the daemon is the daemon's bussiness. The model you
seem to speak about here is somewhat different - more like readdir() kind
of approach where client asks for access to checkpoint T data, daemon
provides the data record by record (probably serving the data from its
files on disk), and when the client is done and "closes" checkpoint T,
daemon's records about checkpoint T can be erased. Am I getting it right?

This however seems somewhat orthogonal to the SRCU idea. SRCU essentially
serves the only purpose - make sure that modifications to all files for
which we have received PRE_WRITE event are visible in respective files.

> > > > The technical problem I see is how to deal with AIO / io_uring because
> > > > SRCU needs to be released in the same context as it is acquired - that
> > > > would need to be consulted with Paul McKenney if we can make it work. And
> > > > another problem I see is that it might not be great to have this
> > > > system-wide as e.g. on networking filesystems or pipes writes can block for
> > > > really long.
> > > >
> > > > Final question is how to expose this to userspace because this
> > > > functionality would seem useful outside of filesystem notification space so
> > > > probably do not need to tie it to that.
> > > >
> > > > Or we could simplify our life somewhat and acquire SRCU when generating
> > > > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU
> > > > within fsnotify and would mitigate the problems coming from system-wide
> > > > SRCU. OTOH it will create problems when PRE_WRITE gets generated and
> > > > POST_WRITE would not for some reason. Just branstorming here, I've not
> > > > really decided what's better...
> 
> Seems there are several non trivial challenges to surmount with this
> "userspace modification SRCU" idea.
> 
> For now, I will stay in my comfort zone and try to make the POC
> with PRE/POST_WRITE work and write the proof of correctness.
> 
> I will have no objection at all if you figure out how to solve those
> issues and guide me to a path for implementing sb_write_srcu.
> It will make the userspace implementation much simpler, getting rid
> of the in-progress writes in-memory tracking.

It seems you have progressed on this front yourself so let's continue there
:).

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR