On Fri 28-10-22 15:50:04, Amir Goldstein wrote: > On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > Questions: > > > - What do you think about the direction this POC has taken so far? > > > - Is there anything specific that you would like to see in the POC > > > to be convinced that this API will be useful? > > > > I think your POC is taking a good direction and your discussion with Dave > > had made me more confident that this is all workable :). I liked your idea > > of the wiki (or whatever form of documentation) that summarizes what we've > > discussed in this thread. That would be actually pretty nice for future > > reference. > > > > The current state of POC is that "populate of access" of both files > and directories is working and "race free evict of file content" is also > implemented (safely AFAIK). > > The technique involving exclusive write lease is discussed at [1]. > In a nutshell, populate and evict synchronize on atomic i_writecount > and this technique can be implemented with upstream UAPIs. Not so much i_writecount AFAIU but the generic lease mechanism overall. But yes, the currently existing APIs should be enough for your purposes. > I did use persistent xattr marks for the POC, but this is not a must. > Evictable inode marks would have worked just as well. OK. > Now I have started to work on persistent change tracking. > For this, I have only kernel code, only lightly tested, but I did not > prove yet that the technique is working. > > The idea that I started to sketch at [2] is to alternate between two groups. > > When a change is recorded, an evictable ignore mark will be added on the > object. To start recording changes from a new point in time > (checkpoint), a new group will be created (with no ignore marks) and the > old group will be closed. So what I dislike about the scheme with handover between two groups is that it is somewhat complex and furthermore requiring fs freezing for checkpoint is going to be rather expensive (and may be problematic if persistent change tracking is used by potentially many unpriviledged applications). As a side note I think it will be quite useful to be able to request checkpoint only for a subtree (e.g. some app may be interested only in a particular subtree) and the scheme with two groups will make any optimizations to benefit from such fact more difficult - either we create new group without ignore marks and then have to re-record changes nobody actually needs or we have to duplicate ignore marks which is potentially expensive as well. Let's think about the race: > To clarify, the race that I am trying to avoid is: > 1. group B got a pre modify event and recorded the change before time T > 2. The actual modification is performed after time T > 3. group A does not get a pre modify event, so does not record the change > in the checkpoint since T AFAIU you are worried about: Task T Change journal App write(file) generate pre_modify event record 'file' as modified Request changes Records 'file' contents modify 'file' data ... Request changes Nothing changed but App's view of 'file' is obsolete. Can't we solve this by creating POST_WRITE async event and then use it like: 1) Set state to CHECKPOINT_PENDING 2) In state CHECKPOINT_PENDING we record all received modify events into a separate 'transition' stream. 3) Remove ignore marks we need to remove. 4) Switch to new period & clear CHECKPOINT_PENDING, all events are now recorded to the new period. 5) Merge all events from 'transition' stream to both old and new period event streams. 6) Events get removed from the 'transition' stream only once we receive POST_WRITE event corresponding to the PRE_WRITE event recorded there (or on crash recovery). This way some events from 'transition' stream may get merged to multiple period event streams if the checkpoints are frequent and writes take long. This should avoid the above race, should be relatively lightweight, and does not require major API extensions. BTW, while thinking about this I was wondering: How are the applications using persistent change journal going to deal with buffered vs direct IO? I currently don't see a scheme that would not loose modifications for some combinations... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR