Re: [LSF/MM TOPIC] Filesystem Change Journal API

Amir Goldstein <amir73il@xxxxxxxxx> · Tue, 23 Jan 2018 12:37:50 +0200

On Tue, Jan 23, 2018 at 12:10 AM, Andreas Dilger <adilger@xxxxxxxxx> wrote:
> On Jan 22, 2018, at 2:20 AM, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>>
>> Change Journal [1] (a.k.a USN) is a popular feature of NTFS v3, used by
>> backup and indexing applications to monitor changes to a file system
>> in a reliable, durable and scalable manner.
>>
>> This year, I would like to discuss solutions to address the reliability
>> and durability aspects of Linux filesystem change tracking.
>>
>> Some Linux filesystems are already journaling everything (e.g. ubifs),
>> so providing the Change Journal feature to applications is probably just
>> a matter of providing an API to retrieve latest USN and enumerate changes
>> within USN range.
>>
>> Some Linux filesystems store USN-like information in metadata, but it is
>> not exposed to userspace in a standard way that could be used by change
>> tracking applications. For example, XFS stores LSN (transaction id) in
>> inodes, so it should be possible to enumerate inodes that were changed
>> since a last known queried LSN value.
>>
>> A more generic approach, for filesystems with no USN-like information,
>> would be to provide an external change journal facility, much like what
>> JBD2 does, but not in the block level. This facility could hook as a
>> consumer of filesystem notifications as an fsnotify backend and provide
>> record and enumerate capabilities for filesystem operations.
>>
>> With the external change journal approach, care would have to be taken to
>> account for the fact that filesystem changes become persistent later than
>> the time they are reported to fsnotify, so at least a transaction commit
>> event (with USN) would need to be reported to fsnotify.
>>
>> The user API to retrieve change journal information should be standard,
>> whether the change journal is a built in filesystem feature or using the
>> external change journal. The fanotify API is a good candidate for change
>> journal API, because it already defines a standard way of reporting
>> filesystem changes. Naturally, the API would have to be extended to cater
>> the needs of a change journal API and would require user to explicitly
>> opt-in for the new API (e.g. FAN_CLASS_CHANGE_JOURNAL).
>>
>> It is possible (?) that networking filesytems could also make use of a
>> kernel change journal API to refresh client caches after server reboot in
>> a more efficient and scalable manner.
>
> I won't be able to make it to LSF/MM this year, but it is worthwhile to
> mention that Lustre also implements a persistent change log for filesystem
> operations.  I don't think there is any code that could be re-used directly
> in the VFS, but the implementation ideas are worthwhile to discuss, and
> having a standardized interface for such operations could also be useful.
>
> I'm not familiar with the NTFS USN implementation, but my brief read shows
> that it has several of the same concepts.  We use this for a variety of
> reasons, but mostly for backup/restore/resync/HSM type operations, but
> auditing was added recently.  The logs are persistent, and the log records
> are atomic (journaled) with the actual filesystem updates.  Log cancellation
> is lazy, but external sync tools keep their own sequence ID for tracking
> which updates have been applied (e.g. for remote filesystem sync) so they
> will only cancel records after they have been committed to remote storage.
>
> The Lustre log records have a 64-bit identifier, a timestamp, and an enum
> type that reports what operation was done on the file (e.g. rename, open,
> close, unlink, link, setxattr, etc), and the same bitmask can be used to
> toggle whether these log records are stored or not (e.g. logging all open
> and close operations is more expensive than you want by default, but can be
> logged (including failed open) for audit purposes).  Each log record has a
> separate flags that indicates which fields are in the record (e.g. UID/GID
> for open, client node ID, parent FID for rename, etc). so that the records
> can be extended as new fields are needed.

OK. That is a point where current fanotify event metadata is lacking
it has fixed struct values like stat does, so we need to extend this structure
similar to statx to provide extra optional info on the event.
My super block fanotify watch patches add extra info to the event (FID)
when user opts-in to get this extra info, but they don't allow for specific
events to report if the FID information is available or not (because it
is required for all events by this mode of operation).

>
> No data or pathnames are stored in the log, only the FID (File ID ~= inode),
> which can be used to open files by handle.  This allows resync/backup tools
> to access the current version of the data, as we don't really care about any
> intermediate versions of the file data that was written.  If the file no
> longer exists, then we can't back it up anyway.

So we reached the same solution. That's a good validation of a design :)
My patches actually do include optional NAME_INFO, but only for rename/
delete/create events along with parent FID. I suppose that is not what you
mean by storing pathnames in the logs.

>
> The ChangeLog can have multiple consumers registered, so the on-disk logfile
> will only cancel records up to the oldest consumer.  If some consumer is
> too far behind (either by date or number of records) and free space is short,
> there is an option to automatically deregister the changelog consumer and
> free up to the next consumer, and it would presumably have to re-scan the
> filesystem to get its state in sync.  This avoids running out of space if
> someone registers a user and it doesn't run for a few weeks (or ever).
>
> For the userspace access, you want an interface that can pass 10's-100's
> of thousands of records per second so that ongoing log processing doesn't
> become the bottleneck for filesystem operations.  We currently use a char
> device, but presumably newer kernel preference would be a netlink socket.
>

fanotify_init syscall returns a file descriptor and that file descriptor can be
polled/read for multiple events. I think that is a solid API for getting events.
There is also a special FAN_Q_OVERFLOW event for limited memory
queue that can be used for out of disk space/quota with persistent event queue.

It's be very interesting to make lustre a case study for generic ChangeLog API.
Can you point me in the direction of an application that uses the lustre
custom ChangeLog API?

Thanks for sharing this information.
Amir.