Re: [LSF/MM TOPIC] Filesystem Change Journal API

Andreas Dilger <adilger@xxxxxxxxx> · Tue, 23 Jan 2018 14:44:41 -0700

On Jan 23, 2018, at 3:37 AM, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> 
> On Tue, Jan 23, 2018 at 12:10 AM, Andreas Dilger <adilger@xxxxxxxxx> wrote:
>> On Jan 22, 2018, at 2:20 AM, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>>> 
>>> Change Journal [1] (a.k.a USN) is a popular feature of NTFS v3, used by
>>> backup and indexing applications to monitor changes to a file system
>>> in a reliable, durable and scalable manner.
>>> 
>>> This year, I would like to discuss solutions to address the reliability
>>> and durability aspects of Linux filesystem change tracking.
>>> 
>>> Some Linux filesystems are already journaling everything (e.g. ubifs),
>>> so providing the Change Journal feature to applications is probably just
>>> a matter of providing an API to retrieve latest USN and enumerate changes
>>> within USN range.
>>> 
>>> Some Linux filesystems store USN-like information in metadata, but it is
>>> not exposed to userspace in a standard way that could be used by change
>>> tracking applications. For example, XFS stores LSN (transaction id) in
>>> inodes, so it should be possible to enumerate inodes that were changed
>>> since a last known queried LSN value.
>>> 
>>> A more generic approach, for filesystems with no USN-like information,
>>> would be to provide an external change journal facility, much like what
>>> JBD2 does, but not in the block level. This facility could hook as a
>>> consumer of filesystem notifications as an fsnotify backend and provide
>>> record and enumerate capabilities for filesystem operations.
>>> 
>>> With the external change journal approach, care would have to be taken to
>>> account for the fact that filesystem changes become persistent later than
>>> the time they are reported to fsnotify, so at least a transaction commit
>>> event (with USN) would need to be reported to fsnotify.
>>> 
>>> The user API to retrieve change journal information should be standard,
>>> whether the change journal is a built in filesystem feature or using the
>>> external change journal. The fanotify API is a good candidate for change
>>> journal API, because it already defines a standard way of reporting
>>> filesystem changes. Naturally, the API would have to be extended to cater
>>> the needs of a change journal API and would require user to explicitly
>>> opt-in for the new API (e.g. FAN_CLASS_CHANGE_JOURNAL).
>>> 
>>> It is possible (?) that networking filesytems could also make use of a
>>> kernel change journal API to refresh client caches after server reboot in
>>> a more efficient and scalable manner.
>> 
>> I won't be able to make it to LSF/MM this year, but it is worthwhile to
>> mention that Lustre also implements a persistent change log for filesystem
>> operations.  I don't think there is any code that could be re-used directly
>> in the VFS, but the implementation ideas are worthwhile to discuss, and
>> having a standardized interface for such operations could also be useful.
>> 
>> I'm not familiar with the NTFS USN implementation, but my brief read shows
>> that it has several of the same concepts.  We use this for a variety of
>> reasons, but mostly for backup/restore/resync/HSM type operations, but
>> auditing was added recently.  The logs are persistent, and the log records
>> are atomic (journaled) with the actual filesystem updates.  Log cancellation
>> is lazy, but external sync tools keep their own sequence ID for tracking
>> which updates have been applied (e.g. for remote filesystem sync) so they
>> will only cancel records after they have been committed to remote storage.
>> 
>> The Lustre log records have a 64-bit identifier, a timestamp, and an enum
>> type that reports what operation was done on the file (e.g. rename, open,
>> close, unlink, link, setxattr, etc), and the same bitmask can be used to
>> toggle whether these log records are stored or not (e.g. logging all open
>> and close operations is more expensive than you want by default, but can be
>> logged (including failed open) for audit purposes).  Each log record has a
>> separate flags that indicates which fields are in the record (e.g. UID/GID
>> for open, client node ID, parent FID for rename, etc). so that the records
>> can be extended as new fields are needed.
> 
> OK. That is a point where current fanotify event metadata is lacking
> it has fixed struct values like stat does, so we need to extend this structure
> similar to statx to provide extra optional info on the event.
> My super block fanotify watch patches add extra info to the event (FID)
> when user opts-in to get this extra info, but they don't allow for specific
> events to report if the FID information is available or not (because it
> is required for all events by this mode of operation).
> 
> 
>> 
>> No data or pathnames are stored in the log, only the FID (File ID ~= inode),
>> which can be used to open files by handle.  This allows resync/backup tools
>> to access the current version of the data, as we don't really care about any
>> intermediate versions of the file data that was written.  If the file no
>> longer exists, then we can't back it up anyway.
> 
> So we reached the same solution. That's a good validation of a design :)
> My patches actually do include optional NAME_INFO, but only for rename/
> delete/create events along with parent FID. I suppose that is not what you
> mean by storing pathnames in the logs.

Right, we also store the from/to filename in the create, link, rename, unlink
records but only relative to the parent FID, not the full pathname.

One implementation detail - since the rename record can be fairly large, it is
actually written as two separate records - a rename source and rename target,
so that the default records only need to hold a single filename component.

>> The ChangeLog can have multiple consumers registered, so the on-disk logfile
>> will only cancel records up to the oldest consumer.  If some consumer is
>> too far behind (either by date or number of records) and free space is short,
>> there is an option to automatically deregister the changelog consumer and
>> free up to the next consumer, and it would presumably have to re-scan the
>> filesystem to get its state in sync.  This avoids running out of space if
>> someone registers a user and it doesn't run for a few weeks (or ever).
>> 
>> For the userspace access, you want an interface that can pass 10's-100's
>> of thousands of records per second so that ongoing log processing doesn't
>> become the bottleneck for filesystem operations.  We currently use a char
>> device, but presumably newer kernel preference would be a netlink socket.
>> 
> 
> fanotify_init syscall returns a file descriptor and that file descriptor can be
> polled/read for multiple events. I think that is a solid API for getting events.
> There is also a special FAN_Q_OVERFLOW event for limited memory
> queue that can be used for out of disk space/quota with persistent event queue.
> 
> It's be very interesting to make lustre a case study for generic ChangeLog API.
> Can you point me in the direction of an application that uses the lustre
> custom ChangeLog API?

The underlying kernel-facing API is abstracted from applications via a library,
so that we would be able to change the implementation in the future.  The
library calls are in

https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=blob;f=lustre/utils/liblustreapi_chlg.c

llapi_changelog_start(), llapi_changelog_fini(), llapi_changelog_recv(),
llapi_changelog_clear()

and a sample user tool that dumps the changelog records is "lustre_rsync" which
uses the changelog to resync a Lustre filesystem to another system without doing
full namespace scans each time:

https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=blob;f=lustre/utils/lustre_rsync.c

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP