Re: thoughts about fanotify and HSM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri 28-10-22 15:50:04, Amir Goldstein wrote:
> On Thu, Sep 22, 2022 at 1:48 PM Jan Kara <jack@xxxxxxx> wrote:
> >
> > > Questions:
> > > - What do you think about the direction this POC has taken so far?
> > > - Is there anything specific that you would like to see in the POC
> > >   to be convinced that this API will be useful?
> >
> > I think your POC is taking a good direction and your discussion with Dave
> > had made me more confident that this is all workable :). I liked your idea
> > of the wiki (or whatever form of documentation) that summarizes what we've
> > discussed in this thread. That would be actually pretty nice for future
> > reference.
> >
> 
> The current state of POC is that "populate of access" of both files
> and directories is working and "race free evict of file content" is also
> implemented (safely AFAIK).
> 
> The technique involving exclusive write lease is discussed at [1].
> In a nutshell, populate and evict synchronize on atomic i_writecount
> and this technique can be implemented with upstream UAPIs.

Not so much i_writecount AFAIU but the generic lease mechanism overall. But
yes, the currently existing APIs should be enough for your purposes.

> I did use persistent xattr marks for the POC, but this is not a must.
> Evictable inode marks would have worked just as well.

OK.

> Now I have started to work on persistent change tracking.
> For this, I have only kernel code, only lightly tested, but I did not
> prove yet that the technique is working.
> 
> The idea that I started to sketch at [2] is to alternate between two groups.
> 
> When a change is recorded, an evictable ignore mark will be added on the
> object.  To start recording changes from a new point in time
> (checkpoint), a new group will be created (with no ignore marks) and the
> old group will be closed.

So what I dislike about the scheme with handover between two groups is that
it is somewhat complex and furthermore requiring fs freezing for checkpoint
is going to be rather expensive (and may be problematic if persistent
change tracking is used by potentially many unpriviledged applications).

As a side note I think it will be quite useful to be able to request
checkpoint only for a subtree (e.g. some app may be interested only in a
particular subtree) and the scheme with two groups will make any
optimizations to benefit from such fact more difficult - either we create
new group without ignore marks and then have to re-record changes nobody
actually needs or we have to duplicate ignore marks which is potentially
expensive as well.

Let's think about the race:

> To clarify, the race that I am trying to avoid is:
> 1. group B got a pre modify event and recorded the change before time T
> 2. The actual modification is performed after time T
> 3. group A does not get a pre modify event, so does not record the change
>     in the checkpoint since T

AFAIU you are worried about:

Task T				Change journal		App

write(file)
  generate pre_modify event
				record 'file' as modified
							Request changes
							Records 'file' contents
  modify 'file' data

...
							Request changes
							Nothing changed but
App's view of 'file' is obsolete.

Can't we solve this by creating POST_WRITE async event and then use it like:

1) Set state to CHECKPOINT_PENDING
2) In state CHECKPOINT_PENDING we record all received modify events into a
   separate 'transition' stream.
3) Remove ignore marks we need to remove.
4) Switch to new period & clear CHECKPOINT_PENDING, all events are now
   recorded to the new period.
5) Merge all events from 'transition' stream to both old and new period
   event streams.
6) Events get removed from the 'transition' stream only once we receive
   POST_WRITE event corresponding to the PRE_WRITE event recorded there (or
   on crash recovery). This way some events from 'transition' stream may
   get merged to multiple period event streams if the checkpoints are
   frequent and writes take long.

This should avoid the above race, should be relatively lightweight, and
does not require major API extensions.

BTW, while thinking about this I was wondering: How are the applications
using persistent change journal going to deal with buffered vs direct IO? I
currently don't see a scheme that would not loose modifications for some
combinations...

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux