On Tue, Nov 15, 2022 at 12:16 PM Jan Kara <jack@xxxxxxx> wrote: > > On Mon 14-11-22 22:08:16, Amir Goldstein wrote: > > On Mon, Nov 14, 2022 at 9:17 PM Jan Kara <jack@xxxxxxx> wrote: > > > > My understanding is that > > > > synchronize_srcu(&fsnotify_mark_srcu); > > > > is needed as barrier between 3) and 4) > > > > > > > > In any case, even if CHECKPOINT_PENDING can work, > > > > with or without FAN_MARK_SYNC, to me personally, understanding > > > > the proof of correctness of alternating groups model is very easy, > > > > while proving correctness for CHECKPOINT_PENDING model is > > > > something that I was not yet able to accomplish. > > > > > > I agree the scheme with CHECKPOINT_PENDING isn't easy to reason about but I > > > don't find your scheme with two groups simpler ;) Let me try to write down > > > rationale for my scheme, I think I can even somewhat simplify it: > > > > > > Write operation consists of: > > > generate PRE_WRITE on F > > > modify data of F > > > generate POST_WRITE on F > > > > > > Checkpoint consists of: > > > clear ignore marks > > > report files for which we received PRE_WRITE or POST_WRITE until this > > > moment > > > > > > And the invariant we try to provide is: If file F was modified during > > > checkpoint T, then we report F as modified during T or some later > > > checkpoint. Where it is matter of quality of implementation that "some > > > later checkpoint" isn't too much later than T but it depends on the > > > frequency of checkpoints, the length of notification queue, etc. so it is > > > difficult to give hard guarantees. > > > > > > And the argument why the scheme maintains the invariant is that if > > > POST_WRITE is generated after "clear ignore marks" finishes, it will get > > > delivered and thus F will get reported as modified in some checkpoint once > > > the event is processed. If POST_WRITE gets generated before "clear ignore > > > marks" finishes and F is among ignored inodes, it means F is already in > > > modified set so it will get reported as part of checkpoint T. Also > > > application will already see modified data when processing list of modified > > > files in checkpoint T. > > > > > > Simple and we don't even need PRE_WRITE here. But maybe you wanted to > > > provide stronger invariant? Like "you are not able to see modified data > > > without seeing F as modified?" But what exactly would be a guarantee here? > > > I feel I'm missing something here but I cannot quite grasp it at this > > > moment... > > > > > > > This is the very basic guarantee that the persistent change tracking snapshots > > need to provide. If a crash happens after modification is done and before > > modification is recorded, we won't detect the modification after reboot. > > Right, crash safety was the point I was missing ;) Thanks for reminding me. > And now I also see why you use filesystem freezing - it is a way to make > things crash safe as otherwise it is difficult to guard against a race > > generate PRE_WRITE for F > PRE_WRITE ignored because file is already > modified > checkpoint happens -> F reported as modified > contents of F fetched > > modify data > transaction commit > <crash> > POST_WRITE never seen so change to F is > never reported > > I just think filesystem freezing is too big hammer for widespread use of > persistent change tracking. Note that fsfreeze is also needed to flush dirty data after modify data, not only to wait for modify data transaction commit. Otherwise the fetched contents of F will differ from contents of F after reboot even if we did wait for POST_WRITE. However, in this case, file contents can be considered corrupted and rsync, for example, will not detect the change either, because mtime does match the previously fetched value. As long as applications write files safely (with rename) fsfreeze is not needed, but for strict change tracking, fsfreeze is needed, so fsfreeze is a policy decision of HSM. > Can't we introduce some SRCU lock / unlock into > file_start_write() / file_end_write() and then invoke synchronize_srcu() > during checkpoint after removing ignore marks? It will be much cheaper as > we don't have to flush all dirty data to disk and also writes can keep > flowing while we wait for outstanding writes straddling checkpoint to > complete. What do you think? Maybe, but this is not enough. Note that my patches [1] are overlapping fsnotify_mark_srcu with file_start_write(), so we would need to overlay fsnotify_mark_srcu with this new "modify SRCU". [1] https://github.com/amir73il/linux/commits/fanotify_pre_content > > The checkpoint would then do: > start gathering changes for both T and T+1 > clear ignore marks > synchronize_srcu() > stop gathering changes for T and report them > > And in this case we would not need POST_WRITE as an event. > Why then give up on the POST_WRITE events idea? Don't you think it could work? Or is it just because you think the generic API would be useful to others? > The technical problem I see is how to deal with AIO / io_uring because > SRCU needs to be released in the same context as it is acquired - that > would need to be consulted with Paul McKenney if we can make it work. And > another problem I see is that it might not be great to have this > system-wide as e.g. on networking filesystems or pipes writes can block for > really long. > > Final question is how to expose this to userspace because this > functionality would seem useful outside of filesystem notification space so > probably do not need to tie it to that. > > Or we could simplify our life somewhat and acquire SRCU when generating > PRE_WRITE and drop it when generating POST_WRITE. This would keep SRCU > within fsnotify and would mitigate the problems coming from system-wide > SRCU. OTOH it will create problems when PRE_WRITE gets generated and > POST_WRITE would not for some reason. Just branstorming here, I've not > really decided what's better... > What if checkpoint only acquired (and released) exclusive sb_writers without flushing dirty data. Wouldn't that be equivalent to the synchronize_srcu() you suggested? > > Maybe "checkpoint" was a bad name to use for this handover between > > two subsequent change tracking snapshots. > > I'm getting used to the terminology :) But to me it still seems more > natural to look at the situation as a single stream of events where we fetch > bulks of changes at certain moments, rather than looking at it as certain > entities collecting events for different time intervals. > I always used "snapshot take" as terminology. I just now started to use "checkpoint" for this userspace HSM implementation. I have no objection to single stream, nor to "flush all evictable" command. I will try to start with this implementation for POC. Thanks, Amir.