Re: thoughts about fanotify and HSM

Amir Goldstein <amir73il@xxxxxxxxx> · Mon, 14 Nov 2022 22:08:16 +0200

On Mon, Nov 14, 2022 at 9:17 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Mon 07-11-22 16:13:37, Amir Goldstein wrote:
> > On Mon, Nov 7, 2022 at 1:10 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > Let's think about the race:
> > > > >
> > > > > > To clarify, the race that I am trying to avoid is:
> > > > > > 1. group B got a pre modify event and recorded the change before time T
> > > > > > 2. The actual modification is performed after time T
> > > > > > 3. group A does not get a pre modify event, so does not record the change
> > > > > >     in the checkpoint since T
> > > > >
> > > > > AFAIU you are worried about:
> > > > >
> > > > > Task T                          Change journal          App
> > > > >
> > > > > write(file)
> > > > >   generate pre_modify event
> > > > >                                 record 'file' as modified
> > > > >                                                         Request changes
> > > > >                                                         Records 'file' contents
> > > > >   modify 'file' data
> > > > >
> > > > > ...
> > > > >                                                         Request changes
> > > > >                                                         Nothing changed but
> > > > > App's view of 'file' is obsolete.
> > > > >
> > > > > Can't we solve this by creating POST_WRITE async event and then use it like:
> > > > >
> > > >
> > > > I like the idea of using POST_WRITE instead of holding sb_writers.
> > > >
> > > > > 1) Set state to CHECKPOINT_PENDING
> > > > > 2) In state CHECKPOINT_PENDING we record all received modify events into a
> > > > >    separate 'transition' stream.
> > > > > 3) Remove ignore marks we need to remove.
> > > >
> > > > Our customer use cases may have many millions of dirs.
> > > > I don't think this solution will be scalable, which is why I use the
> > > > alternating groups to invalidate all the existing ignore marks at once.
> > >
> > > I see. Well, we could also extend FAN_MARK_FLUSH so that you can just
> > > remove all ignore marks from a group so that you don't have to remove them
> > > one-by-one and don't have to switch to a new group. In principle group
> > > teardown does the same. It would allow large scale as well as small scale
> > > users use very similar scheme with single group for switching periods.
> > >
> >
> > Maybe so, I need to try it to see if it can scale.
> >
> > But note that in the alternating group scheme we do NOT need to wait for
> > the old group teardown to complete before returning the results of the query:
> > 1. group T+1 subscribes to some events with FAN_MARK_SYNC
> > 2. group T unsubscribes from those events
> >
> > Step 2 can be done in the background.
> > The query which returns all the files modified between checkpoints T..T+1
> > can already return the changes recorded by group T while group T is
> > shutting down and cleaning up all the evictable marks.
>
> I agree. With my scheme with a single group we first need to remove the
> ignore marks before we can report events for T and start collecting events
> for T+1. Total amount of work is going to be the same but latency of
> reporting is going to be higher. I'd just think that it would not be too bad
> but it needs to be measured. Also from the scheme you've described the
> declaration of a checkpoint didn't seem like a latency sensitive operation?
>

I guess not. I will look into it.

> > > > But I agree that alternating groups should not be a requirement for HSM
> > > > and that for watching smaller subtrees, your suggestion makes more sense.
> > > >
> > > > > 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now
> > > > >    recorded to the new period.
> > > > > 5) Merge all events from 'transition' stream to both old and new period
> > > > >    event streams.
> > > > > 6) Events get removed from the 'transition' stream only once we receive
> > > > >    POST_WRITE event corresponding to the PRE_WRITE event recorded there (or
> > > > >    on crash recovery). This way some events from 'transition' stream may
> > > > >    get merged to multiple period event streams if the checkpoints are
> > > > >    frequent and writes take long.
> > > > >
> > > > > This should avoid the above race, should be relatively lightweight, and
> > > > > does not require major API extensions.
> > > > >
> > > >
> > > > If I am not mistaken, CHECKPOINT_PENDING vs. alternating groups
> > > > is an implementation detail for the HSM.
> > > >
> > > > PRE_WRITE/POST_WRITE and FAN_MARK_SYNC APIs are needed
> > > > for both the implementations (single group scheme needs to flush all
> > > > ignore marks with FAN_MARK_SYNC).
> > >
> > > So why would be FAN_MARK_SYNC needed for the single group scheme? From the
> > > kernel POV the scheme I have proposed does not require any new API changes
> > > besides the POST_WRITE event AFAICT. And possibly FAN_MARK_FLUSH tweak for
> > > more efficient removal of ignore marks. We don't even need the filesystem
> > > freezing (modulo the buffered vs direct IO discussion below).
> > >
> >
> > Maybe I'm wrong, but my understanding is that after:
> > 3) Remove ignore marks we need to remove.
> > a PRE_WRITE event may still be in send_to_group()
> > with one of the "removed" ignore marks and be ignored.
> >
> > So it is not safe to:
> > 4) Switch to new period & clear CHECKPOINT_PENDING
>
> Well, but we'd still get the POST_WRITE event, evaluate this as a write
> straddling checkpoint and include the file into the set of changed files
> for checkpoint T+1 or later. So I don't think synchronize_srcu() is needed
> anywhere?
>

You are probably right. I will double check.

> > My understanding is that
> > synchronize_srcu(&fsnotify_mark_srcu);
> > is needed as barrier between 3) and 4)
> >
> > In any case, even if CHECKPOINT_PENDING can work,
> > with or without FAN_MARK_SYNC, to me personally, understanding
> > the proof of correctness of alternating groups model is very easy,
> > while proving correctness for CHECKPOINT_PENDING model is
> > something that I was not yet able to accomplish.
>
> I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I
> don't find your scheme with two groups simpler ;) Let me try to write down
> rationale for my scheme, I think I can even somewhat simplify it:
>
> Write operation consists of:
> generate PRE_WRITE on F
> modify data of F
> generate POST_WRITE on F
>
> Checkpoint consists of:
> clear ignore marks
> report files for which we received PRE_WRITE or POST_WRITE until this
> moment
>
> And the invariant we try to provide is: If file F was modified during
> checkpoint T, then we report F as modified during T or some later
> checkpoint. Where it is matter of quality of implementation that "some
> later checkpoint" isn't too much later than T but it depends on the
> frequency of checkpoints, the length of notification queue, etc. so it is
> difficult to give hard guarantees.
>
> And the argument why the scheme maintains the invariant is that if
> POST_WRITE is generated after "clear ignore marks" finishes, it will get
> delivered and thus F will get reported as modified in some checkpoint once
> the event is processed. If POST_WRITE gets generated before "clear ignore
> marks" finishes and F is among ignored inodes, it means F is already in
> modified set so it will get reported as part of checkpoint T. Also
> application will already see modified data when processing list of modified
> files in checkpoint T.
>
> Simple and we don't even need PRE_WRITE here. But maybe you wanted to
> provide stronger invariant? Like "you are not able to see modified data
> without seeing F as modified?" But what exactly would be a guarantee here?
> I feel I'm missing something here but I cannot quite grasp it at this
> moment...
>

This is the very basic guarantee that the persistent change tracking snapshots
need to provide. If a crash happens after modification is done and before
modification is recorded, we won't detect the modification after reboot.

Maybe "checkpoint" was a bad name to use for this handover between
two subsequent change tracking snapshots.

> > > > I am going to try to implement the PRE/POST_WRITE events and for
> > > > POC I may start with a single group because it may be easier or I may
> > > > implement both schemes, we'll see.
> > >
> > > That would be great. Thank you. Perhaps I'm missing something in my mental
> > > model which would make things impractical :)
> > >
> >
> > Me too. I won't know before I try.
> >
> > FYI, at the moment I am considering not allowing independent
> > subscription for PRE/POST_XXX events, but only subscribe to
> > XXX events and a group with class FAN_CLASS_VFS_FILTER
> > will get both PRE/POST_XXX and won't be able to subscribe
> > to XXX events that do not have PRE_XXX events.
> >
> > The rationale is that if a group subscribes to either PRE/POST_XXX
> > XXX() operation is not going to be on the fast path anyway.
> >
> > This will make it easier to support more PRE/POST_XXX events
> > without using up all the remaining 32bits namespace.
> >
> > Using the high 32bits of mask for PRE events and folding them
> > in the object interest mask with the low 32bit is another thing
> > that I was considering in case you would prefer to allow
> > independent subscription for PRE/POST_XXX events.
>
> So one question: Do you see anything else than POST_WRITE as being useful?
> For directory operations it seems pointless as they hold i_rwsem exclusively
> so I don't see useful distinction between PRE and POST event. For OPEN and
> CLOSE I don't see use either. ACCESS might be the only one where PRE and
> POST would both be useful for something.
>

PRE_ACCESS is used to populate missing data and POST_ACCESS
is irrelevant for that.

PRE_MODIFY is used for something completely different, it is used
for the persistent change tracking and this has to be crash safe, so
exclusive i_rwsem has nothing to do with it.

PRE_MODIFY is called before i_rwsem and before sb_start_write()
so that HSM can record the change intent in the same filesystem
that the change is going to happen (this is a journaled record).

I feel I may have not explained this correctly.
Does this make sense?

Thanks,
Amir.