Re: [LSF/MM TOPIC] Filesystem Change Journal API

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 25 Jan 2018 20:33:22 +0200

On Thu, Jan 25, 2018 at 6:52 PM, Jan Kara <jack@xxxxxxx> wrote:
> On Thu 25-01-18 18:26:21, Amir Goldstein wrote:
>> On Thu, Jan 25, 2018 at 5:45 PM, Jan Kara <jack@xxxxxxx> wrote:
>> >> A more generic approach, for filesystems with no USN-like information,
>> >> would be to provide an external change journal facility, much like what
>> >> JBD2 does, but not in the block level. This facility could hook as a
>> >> consumer of filesystem notifications as an fsnotify backend and provide
>> >> record and enumerate capabilities for filesystem operations.
>> >>
>> >> With the external change journal approach, care would have to be taken to
>> >> account for the fact that filesystem changes become persistent later than
>> >> the time they are reported to fsnotify, so at least a transaction commit
>> >> event (with USN) would need to be reported to fsnotify.
>> >
>> > Frankly, this is very hard and I'm not sure you can make it both race free
>> > and fs agnostic. I actually think it would be enough if we provided
>> > guranteed persistence & consistence across clean reboots. In case of
>> > crashes we would just need to flag that force rescan-the-world event for
>> > users of the API - again this is pretty much what FSEvents does.
>> >
>>
>> The requirement from my employer that drives the need for persistent change
>> log in filesystem/kernel is that rescan-the-world takes way too much time.
>> So rescan-the-world cannot be the answer to persistent change log requirement.
>> There are just too many files these day and age...
>
> I can understand that but then you are basically bound to solutions that
> tie directly into filesystem's consistency tracking machinery (be it
> journalling, COW-like methods, or anything else). I.e., you have to
> implement the change journal independently for each filesystem. And also
> live with the fact that some filesystems will never support this because
> they cannot achieve such consistency guarantees.
>

I can live with that -
fanotify can be used on *all* file systems to track changes.
fanotify change journal requires some low level fs support.
Implementing and external generic journal may be hard and I can't
say if we will ever get there, but I'm not convinced it is not possible.

>> >> The user API to retrieve change journal information should be standard,
>> >> whether the change journal is a built in filesystem feature or using the
>> >> external change journal. The fanotify API is a good candidate for change
>> >> journal API, because it already defines a standard way of reporting
>> >> filesystem changes. Naturally, the API would have to be extended to cater
>> >> the needs of a change journal API and would require user to explicitly
>> >> opt-in for the new API (e.g. FAN_CLASS_CHANGE_JOURNAL).
>> >
>> > So I actually believe the persistence would be the easiest to handle
>> > completely in userspace as a daemon + library to access it. The daemon
>> > could use fanotify + database file for storage for filesystems which don't
>> > have built in persistent change log and hook into filesystem specific
>> > facility where it knows how to...
>> >
>>
>> Sure, whatever could be done by userspace is better.  The user of kernel
>> change journal API *is* that change db application, (e.g.  which decides
>> which files need to be synced to the cloud). It just can't afford to
>> rescan-the-world on non clean shutdown.
>>
>> I believe that in the absence of an external change journal implementation,
>> the minimal requirement from filesystem is to provide an inode iterator and
>> some sort of USN-like property that can be used to filter 'changes since USN'.
>
> Well, for the sizes of filesystems you speak about here, is really a
> bulkstat of the whole filesystem viable? I know it is way faster than
> scanning through directory hierarchy but still...
>

As you said, its a start. It's much faster. I would like to do better.
I do have an Ace in my sleeve. It's the overlayfs snapshot.
It's not quite like fanotify nor NTFS Change Journal, but it provides
a map of inodes modified since snapshot take indexed by file handles.
Still not sure how it all adds up to a unified API that can be served
with a change journal library.

>> This fits well to XFS's bulkstat API and the inode LSN metadata.
>> XFS is my target filesystem anyway, so I could go a head and use those FS
>> specific APIs, but would like to start with looking at all other requirements
>> and what information other filesystems can provide and try to design an API
>> that could work with several filesystems and at least make a future generic
>> implementation possible.
>
> Do you really need LSN in the above scheme? Would not mtime + i_version be
> enough for your purposes? That should be much easier to get among
> filesystems...
>

I was under the impression that i_version is not persistent.
mtime is not reliable, because use can change it (maybe ctime).
The appeal of LSN is that is really provides an ordering guaranty.
But yeh, that's just one more option.

I'm interested to find out if the XFS/btrfs internal intent logs
could be used to provide enough information on which inodes have
been changed since the oldest LSN in the log.
If it is possible, then that could be a best effort implementation
of change journal - change tracking application can examine the
intent log before mounting the filesystem. If it was lucky enough
to keep up with online changes up to oldest LSN, then
bulkstat-the-world could be avoided.

Thanks,
Amir.