Re: [LSF/MM TOPIC] Filesystem Change Journal API

Andreas Dilger <adilger@xxxxxxxxx> · Mon, 22 Jan 2018 15:10:33 -0700

On Jan 22, 2018, at 2:20 AM, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> 
> Change Journal [1] (a.k.a USN) is a popular feature of NTFS v3, used by
> backup and indexing applications to monitor changes to a file system
> in a reliable, durable and scalable manner.
> 
> This year, I would like to discuss solutions to address the reliability
> and durability aspects of Linux filesystem change tracking.
> 
> Some Linux filesystems are already journaling everything (e.g. ubifs),
> so providing the Change Journal feature to applications is probably just
> a matter of providing an API to retrieve latest USN and enumerate changes
> within USN range.
> 
> Some Linux filesystems store USN-like information in metadata, but it is
> not exposed to userspace in a standard way that could be used by change
> tracking applications. For example, XFS stores LSN (transaction id) in
> inodes, so it should be possible to enumerate inodes that were changed
> since a last known queried LSN value.
> 
> A more generic approach, for filesystems with no USN-like information,
> would be to provide an external change journal facility, much like what
> JBD2 does, but not in the block level. This facility could hook as a
> consumer of filesystem notifications as an fsnotify backend and provide
> record and enumerate capabilities for filesystem operations.
> 
> With the external change journal approach, care would have to be taken to
> account for the fact that filesystem changes become persistent later than
> the time they are reported to fsnotify, so at least a transaction commit
> event (with USN) would need to be reported to fsnotify.
> 
> The user API to retrieve change journal information should be standard,
> whether the change journal is a built in filesystem feature or using the
> external change journal. The fanotify API is a good candidate for change
> journal API, because it already defines a standard way of reporting
> filesystem changes. Naturally, the API would have to be extended to cater
> the needs of a change journal API and would require user to explicitly
> opt-in for the new API (e.g. FAN_CLASS_CHANGE_JOURNAL).
> 
> It is possible (?) that networking filesytems could also make use of a
> kernel change journal API to refresh client caches after server reboot in
> a more efficient and scalable manner.

I won't be able to make it to LSF/MM this year, but it is worthwhile to
mention that Lustre also implements a persistent change log for filesystem
operations.  I don't think there is any code that could be re-used directly
in the VFS, but the implementation ideas are worthwhile to discuss, and
having a standardized interface for such operations could also be useful.

I'm not familiar with the NTFS USN implementation, but my brief read shows
that it has several of the same concepts.  We use this for a variety of
reasons, but mostly for backup/restore/resync/HSM type operations, but
auditing was added recently.  The logs are persistent, and the log records
are atomic (journaled) with the actual filesystem updates.  Log cancellation
is lazy, but external sync tools keep their own sequence ID for tracking
which updates have been applied (e.g. for remote filesystem sync) so they
will only cancel records after they have been committed to remote storage.

The Lustre log records have a 64-bit identifier, a timestamp, and an enum
type that reports what operation was done on the file (e.g. rename, open,
close, unlink, link, setxattr, etc), and the same bitmask can be used to
toggle whether these log records are stored or not (e.g. logging all open
and close operations is more expensive than you want by default, but can be
logged (including failed open) for audit purposes).  Each log record has a
separate flags that indicates which fields are in the record (e.g. UID/GID
for open, client node ID, parent FID for rename, etc). so that the records
can be extended as new fields are needed.

No data or pathnames are stored in the log, only the FID (File ID ~= inode),
which can be used to open files by handle.  This allows resync/backup tools
to access the current version of the data, as we don't really care about any
intermediate versions of the file data that was written.  If the file no
longer exists, then we can't back it up anyway.

The ChangeLog can have multiple consumers registered, so the on-disk logfile
will only cancel records up to the oldest consumer.  If some consumer is
too far behind (either by date or number of records) and free space is short,
there is an option to automatically deregister the changelog consumer and
free up to the next consumer, and it would presumably have to re-scan the
filesystem to get its state in sync.  This avoids running out of space if
someone registers a user and it doesn't run for a few weeks (or ever).

For the userspace access, you want an interface that can pass 10's-100's
of thousands of records per second so that ongoing log processing doesn't
become the bottleneck for filesystem operations.  We currently use a char
device, but presumably newer kernel preference would be a netlink socket.

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP