Re: thoughts about fanotify and HSM

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 16 Nov 2022 18:24:06 +0200

> > > Can't we introduce some SRCU lock / unlock into
> > > file_start_write() / file_end_write() and then invoke synchronize_srcu()
> > > during checkpoint after removing ignore marks? It will be much cheaper as
> > > we don't have to flush all dirty data to disk and also writes can keep
> > > flowing while we wait for outstanding writes straddling checkpoint to
> > > complete. What do you think?
> >
> > Maybe, but this is not enough.
> > Note that my patches [1] are overlapping fsnotify_mark_srcu with
> > file_start_write(), so we would need to overlay fsnotify_mark_srcu
> > with this new "modify SRCU".
> >
> > [1] https://github.com/amir73il/linux/commits/fanotify_pre_content
>
> Yes, I know that and frankly, that is what I find somewhat ugly :) I'd rather
> have the "modify SRCU" cover the whole region we need - which means
> including the generation of PRE_MODIFY event.
>

Yeh, it would be great if we can pull this off.

> > > The checkpoint would then do:
> > > start gathering changes for both T and T+1
> > > clear ignore marks
> > > synchronize_srcu()
> > > stop gathering changes for T and report them
> > >
> > > And in this case we would not need POST_WRITE as an event.
> > >
> >
> > Why then give up on the POST_WRITE events idea?
> > Don't you think it could work?
>
> So as we are discussing, the POST_WRITE event is not useful when we want to
> handle crash safety. And if we have some other mechanism (like SRCU) which
> is able to guarantee crash safety, then what is the benefit of POST_WRITE?
> I'm not against POST_WRITE, I just don't see much value in it if we have
> another mechanism to deal with events straddling checkpoint.
>

Not sure I follow.

I think that crash safety can be achieved also with PRE/POST_WRITE:
- PRE_WRITE records an intent to write in persistent snapshot T
  and add to in-memory map of in-progress writes of period T
- When "checkpoint T" starts, new PRE_WRITES are recorded in both
  T and T+1 persistent snapshots, but event is added only to
  in-memory map of in-progress writes of period T+1
- "checkpoint T" ends when all in-progress writes of T are completed

The trick with alternating snapshots "handover" is this
(perhaps I never explained it and I need to elaborate on the wiki [1]):

[1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Modified_files_query

The changed files query results need to include recorded changes in both
"finalizing" snapshot T and the new snapshot T+1 that was started in
the beginning of the query.

Snapshot T MUST NOT be discarded until checkpoint/handover
is complete AND the query results that contain changes recorded
in T and T+1 snapshots have been consumed.

When the consumer ACKs that the query results have been safely stored
or acted upon (I called this operation "bless" snapshot T+1) then and
only then can snapshot T be discarded.

After snapshot T is discarded a new query will start snapshot T+2.
A changed files query result includes the id of the last blessed snapshot.

I think this is more or less equivalent to the SRCU that you suggested,
but all the work is done in userspace at application level.

If you see any problem with this scheme or don't understand it
please let me know and I will try to explain better.

> > > The technical problem I see is how to deal with AIO / io_uring because
> > > SRCU needs to be released in the same context as it is acquired - that
> > > would need to be consulted with Paul McKenney if we can make it work. And
> > > another problem I see is that it might not be great to have this
> > > system-wide as e.g. on networking filesystems or pipes writes can block for
> > > really long.
> > >
> > > Final question is how to expose this to userspace because this
> > > functionality would seem useful outside of filesystem notification space so
> > > probably do not need to tie it to that.
> > >
> > > Or we could simplify our life somewhat and acquire SRCU when generating
> > > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU
> > > within fsnotify and would mitigate the problems coming from system-wide
> > > SRCU. OTOH it will create problems when PRE_WRITE gets generated and
> > > POST_WRITE would not for some reason. Just branstorming here, I've not
> > > really decided what's better...
> > >

Seems there are several non trivial challenges to surmount with this
"userspace modification SRCU" idea.

For now, I will stay in my comfort zone and try to make the POC
with PRE/POST_WRITE work and write the proof of correctness.

I will have no objection at all if you figure out how to solve those
issues and guide me to a path for implementing sb_write_srcu.
It will make the userspace implementation much simpler, getting rid
of the in-progress writes in-memory tracking.

> >
> > What if checkpoint only acquired (and released) exclusive sb_writers without
> > flushing dirty data.
> > Wouldn't that be equivalent to the synchronize_srcu() you suggested?
>
> In terms of guarantees it would be equivalent. In terms of impact on the
> system it will be considerably worse. Because SRCU allows new SRCU readers
> to start while synchronize_srcu() is running - so in our case new writes
> can freely run while we are waiting for pending writes to complete. So
> impact of the synchronize_srcu() on system activity will be practically
> unnoticeable. If we use sb_writers as you suggest, it will block all write
> activity until all writes finish. Which can be significant amount of time
> if you have e.g. write(1 GB of data) running.
>

Of course, it was a silly idea.

Thanks,
Amir.