On Jan 23, 2018, at 3:37 AM, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > On Tue, Jan 23, 2018 at 12:10 AM, Andreas Dilger <adilger@xxxxxxxxx> wrote: >> On Jan 22, 2018, at 2:20 AM, Amir Goldstein <amir73il@xxxxxxxxx> wrote: >>> >>> Change Journal [1] (a.k.a USN) is a popular feature of NTFS v3, used by >>> backup and indexing applications to monitor changes to a file system >>> in a reliable, durable and scalable manner. >>> >>> This year, I would like to discuss solutions to address the reliability >>> and durability aspects of Linux filesystem change tracking. >>> >>> Some Linux filesystems are already journaling everything (e.g. ubifs), >>> so providing the Change Journal feature to applications is probably just >>> a matter of providing an API to retrieve latest USN and enumerate changes >>> within USN range. >>> >>> Some Linux filesystems store USN-like information in metadata, but it is >>> not exposed to userspace in a standard way that could be used by change >>> tracking applications. For example, XFS stores LSN (transaction id) in >>> inodes, so it should be possible to enumerate inodes that were changed >>> since a last known queried LSN value. >>> >>> A more generic approach, for filesystems with no USN-like information, >>> would be to provide an external change journal facility, much like what >>> JBD2 does, but not in the block level. This facility could hook as a >>> consumer of filesystem notifications as an fsnotify backend and provide >>> record and enumerate capabilities for filesystem operations. >>> >>> With the external change journal approach, care would have to be taken to >>> account for the fact that filesystem changes become persistent later than >>> the time they are reported to fsnotify, so at least a transaction commit >>> event (with USN) would need to be reported to fsnotify. >>> >>> The user API to retrieve change journal information should be standard, >>> whether the change journal is a built in filesystem feature or using the >>> external change journal. The fanotify API is a good candidate for change >>> journal API, because it already defines a standard way of reporting >>> filesystem changes. Naturally, the API would have to be extended to cater >>> the needs of a change journal API and would require user to explicitly >>> opt-in for the new API (e.g. FAN_CLASS_CHANGE_JOURNAL). >>> >>> It is possible (?) that networking filesytems could also make use of a >>> kernel change journal API to refresh client caches after server reboot in >>> a more efficient and scalable manner. >> >> I won't be able to make it to LSF/MM this year, but it is worthwhile to >> mention that Lustre also implements a persistent change log for filesystem >> operations. I don't think there is any code that could be re-used directly >> in the VFS, but the implementation ideas are worthwhile to discuss, and >> having a standardized interface for such operations could also be useful. >> >> I'm not familiar with the NTFS USN implementation, but my brief read shows >> that it has several of the same concepts. We use this for a variety of >> reasons, but mostly for backup/restore/resync/HSM type operations, but >> auditing was added recently. The logs are persistent, and the log records >> are atomic (journaled) with the actual filesystem updates. Log cancellation >> is lazy, but external sync tools keep their own sequence ID for tracking >> which updates have been applied (e.g. for remote filesystem sync) so they >> will only cancel records after they have been committed to remote storage. >> >> The Lustre log records have a 64-bit identifier, a timestamp, and an enum >> type that reports what operation was done on the file (e.g. rename, open, >> close, unlink, link, setxattr, etc), and the same bitmask can be used to >> toggle whether these log records are stored or not (e.g. logging all open >> and close operations is more expensive than you want by default, but can be >> logged (including failed open) for audit purposes). Each log record has a >> separate flags that indicates which fields are in the record (e.g. UID/GID >> for open, client node ID, parent FID for rename, etc). so that the records >> can be extended as new fields are needed. > > OK. That is a point where current fanotify event metadata is lacking > it has fixed struct values like stat does, so we need to extend this structure > similar to statx to provide extra optional info on the event. > My super block fanotify watch patches add extra info to the event (FID) > when user opts-in to get this extra info, but they don't allow for specific > events to report if the FID information is available or not (because it > is required for all events by this mode of operation). > > >> >> No data or pathnames are stored in the log, only the FID (File ID ~= inode), >> which can be used to open files by handle. This allows resync/backup tools >> to access the current version of the data, as we don't really care about any >> intermediate versions of the file data that was written. If the file no >> longer exists, then we can't back it up anyway. > > So we reached the same solution. That's a good validation of a design :) > My patches actually do include optional NAME_INFO, but only for rename/ > delete/create events along with parent FID. I suppose that is not what you > mean by storing pathnames in the logs. Right, we also store the from/to filename in the create, link, rename, unlink records but only relative to the parent FID, not the full pathname. One implementation detail - since the rename record can be fairly large, it is actually written as two separate records - a rename source and rename target, so that the default records only need to hold a single filename component. >> The ChangeLog can have multiple consumers registered, so the on-disk logfile >> will only cancel records up to the oldest consumer. If some consumer is >> too far behind (either by date or number of records) and free space is short, >> there is an option to automatically deregister the changelog consumer and >> free up to the next consumer, and it would presumably have to re-scan the >> filesystem to get its state in sync. This avoids running out of space if >> someone registers a user and it doesn't run for a few weeks (or ever). >> >> For the userspace access, you want an interface that can pass 10's-100's >> of thousands of records per second so that ongoing log processing doesn't >> become the bottleneck for filesystem operations. We currently use a char >> device, but presumably newer kernel preference would be a netlink socket. >> > > fanotify_init syscall returns a file descriptor and that file descriptor can be > polled/read for multiple events. I think that is a solid API for getting events. > There is also a special FAN_Q_OVERFLOW event for limited memory > queue that can be used for out of disk space/quota with persistent event queue. > > It's be very interesting to make lustre a case study for generic ChangeLog API. > Can you point me in the direction of an application that uses the lustre > custom ChangeLog API? The underlying kernel-facing API is abstracted from applications via a library, so that we would be able to change the implementation in the future. The library calls are in https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=blob;f=lustre/utils/liblustreapi_chlg.c llapi_changelog_start(), llapi_changelog_fini(), llapi_changelog_recv(), llapi_changelog_clear() and a sample user tool that dumps the changelog records is "lustre_rsync" which uses the changelog to resync a Lustre filesystem to another system without doing full namespace scans each time: https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=blob;f=lustre/utils/lustre_rsync.c Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP