Re: [LSF/MM TOPIC] Filesystem Change Journal API

Andreas Dilger <adilger@xxxxxxxxx> · Thu, 25 Jan 2018 14:52:02 -0700

On Jan 25, 2018, at 8:45 AM, Jan Kara <jack@xxxxxxx> wrote:
> 
> Hi Amir!
> 
> On Mon 22-01-18 11:18:49, Amir Goldstein wrote:
>> Change Journal [1] (a.k.a USN) is a popular feature of NTFS v3, used by
>> backup and indexing applications to monitor changes to a file system
>> in a reliable, durable and scalable manner.
>> 
>> Linux is lagging behind Windows w.r.t those capabilities by two decades
>> and it is not because lack of demand for the feature. I dare to make a
>> wild guess that there are much more file servers nowadays running on
>> Linux, then there are file servers running on Windows and the scale of
>> changes to track only increased over the years.
> 
> Not only Windows but also MacOS which has FSEvents API [1].
> 
>> On LSF/MM 2017, I presented "fanotify super block watch" [2], which
>> addresses the scalability issues of inotify when tracking changes over
>> millions of directories. This work is running in production now, but is
>> not yet ready for upstream submission.
> 
> Actually I'd be interested in addressing fanotify shortcomings first before
> adding even more complexity with persistence... Adding directory events in
> the form 'something has changed' should be straightforward and good enough
> (this is the granularity of information FSEvents API from MacOS provides as
> well). Adding some way to overcome namespace issues so that unshare(2) is
> not enough to hide your changes from mountpoint watches.
> 
>> This year, I would like to discuss solutions to address the reliability
>> and durability aspects of Linux filesystem change tracking.
>> 
>> Some Linux filesystems are already journaling everything (e.g. ubifs),
>> so providing the Change Journal feature to applications is probably just
>> a matter of providing an API to retrieve latest USN and enumerate changes
>> within USN range.
>> 
>> Some Linux filesystems store USN-like information in metadata, but it is
>> not exposed to userspace in a standard way that could be used by change
>> tracking applications. For example, XFS stores LSN (transaction id) in
>> inodes, so it should be possible to enumerate inodes that were changed
>> since a last known queried LSN value.
>> 
>> A more generic approach, for filesystems with no USN-like information,
>> would be to provide an external change journal facility, much like what
>> JBD2 does, but not in the block level. This facility could hook as a
>> consumer of filesystem notifications as an fsnotify backend and provide
>> record and enumerate capabilities for filesystem operations.
>> 
>> With the external change journal approach, care would have to be taken to
>> account for the fact that filesystem changes become persistent later than
>> the time they are reported to fsnotify, so at least a transaction commit
>> event (with USN) would need to be reported to fsnotify.
> 
> Frankly, this is very hard and I'm not sure you can make it both race free
> and fs agnostic. I actually think it would be enough if we provided
> guranteed persistence & consistence across clean reboots. In case of
> crashes we would just need to flag that force rescan-the-world event for
> users of the API - again this is pretty much what FSEvents does.

IMHO, if you have to rescan the whole filesystem on any unclean reboot
then there is hardly any point in having a persistent changelog at all.
You may as well just do the scan and use in-memory notification events,
and save the extra IO overhead of logging the change records to disk.

Doing a full namespace scan of a large filesystem can take hours, and
hurts application performance the whole time.  Doing bulk scanning of
the inode table (e.g. like https://github.com/ORNL-TechInt/lester) is
faster than a namespace scan, but should only need to be done in
exceptional circumstances.

>> The user API to retrieve change journal information should be standard,
>> whether the change journal is a built in filesystem feature or using the
>> external change journal. The fanotify API is a good candidate for change
>> journal API, because it already defines a standard way of reporting
>> filesystem changes. Naturally, the API would have to be extended to cater
>> the needs of a change journal API and would require user to explicitly
>> opt-in for the new API (e.g. FAN_CLASS_CHANGE_JOURNAL).
> 
> So I actually believe the persistence would be the easiest to handle
> completely in userspace as a daemon + library to access it. The daemon
> could use fanotify + database file for storage for filesystems which don't
> have built in persistent change log and hook into filesystem specific
> facility where it knows how to...

Similarly, this is not very useful IMHO, as the userspace logging can become
inconsistent quite easily in case of a crash or disconnected disk, etc.

The main benefit of journaling the changelog records as part of the original
filesystem operation is that they are transactionally updated and cannot
become out of sync if there is a crash (at worst the filesystem operation is
not committed after a failure, and the corresponding changelog records that
describe them are also not committed).

With a userspace daemon + database it is possible that updates are made to
the filesystem, but logs not written to the database, resulting in files not
being backed up (or whatever).  If the database is written on a separate
device it is also possible that logs are written but the changes were lost.

One option would be to have something similar to quota, where there is a
library called from within the filesystem that writes the log records to
log files, but it is done within the context of the filesystem transaction.

If you want to build a userspace database for random query operations (rather
than just sequential processing) on top of the persistent transactional logs
written by the filesysem, that is easily done.  It would also handle the case
where the database needs to be stopped and restarted for upgrade (or crashes,
or whatever), without losing filesystem events or having to do a full rescan.

As a side note, if logging to files internally, there needs to be some kind
of tree structure (we have a "catalog" that references multiple log files)
both to be able to hold a large number of log entries, and to allow releasing
space from old log files once all of the records have been processed.  Doing
punch operations for each record is inefficient, and if the records are less
than a block in size will not actually release space.

The catalog/llog file format we use has a header with a bitmap at the start,
and records are appended sequentially to the file (or multiple llog files
in parallel if this becomes a performance issue). When llog records are processed/cancelled only the header bitmap needs to be updated.  When all
of the records in the llog file are finished, it is deleted, and its record
in the catalog is cancelled.  That allows efficient writing, reading, and
cancellation of llog records (even if not processed strictly sequentially).

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP