Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > But after I've written and sync'd the data, I set the xattr to mark the > > file not open. At the moment I'm doing this too lazily, only doing it > > when a netfs file gets evicted or when the cache gets withdrawn, but I > > really need to add a queue of objects to be sealed as they're closed. The > > balance is working out how often to do the sealing as something like a > > shell script can do a lot of consecutive open/write/close ops. > > You could add an internal vfs API wait_for_multiple_inodes_to_be_synced(). > For example, xfs keeps the "LSN" on each inode, so once the transaction > with some LSN has been committed, all the relevant inodes, if not dirty, can > be declared as synced, without having to call fsync() on any file and without > having to force transaction commit or any IO at all. > > Since fscache takes care of submitting the IO, and it shouldn't care about any > specific time that the data/metadata hits the disk(?), you can make use of the > existing periodic writeback and rolling transaction commit and only ever need > to wait for that to happen before marking cache files "closed". > > There was a discussion about fsyncing a range of files on LSFMM [1]. > In the last comment on the article dchinner argues why we already have that > API (and now also with io_uring(), but AFAIK, we do not have a useful > wait_for_sync() API. And it doesn't need to be exposed to userspace at all. > > [1] https://lwn.net/Articles/789024/ This sounds like an interesting idea. Actually, what I probably want is a notification to say that a particular object has been completely sync'd to disk, metadata and all. I'm not sure that io_uring is particularly usable from within the kernel, though. > If I were you, I would try to avoid re-implementing a journaled filesystem or > a database for fscache and try to make use of crash consistency guarantees > that filesystems already provide. > Namely, use the data dependency already provided by temp files. > It doesn't need to be one temp file per cached file. > > Always easier said than done ;-) Yes. There are a number of considerations I have to deal with, and they're somewhat at odds with each other: (1) I need to record what data I have stored from a file. (2) I need to record where I stored the data. (3) I need to make sure that I don't see old data. (4) I need to make sure that I don't see data in the wrong file. (5) I need to make sure I lose as little as possible on a crash. (6) I want to be able to record what changes were made in the event we're disconnected from the server. For my fscache-iter branch, (1) is done with a map in an xattr, but I only cache up to 1G in a file at the moment; (2), (4) and, to some extent (5), are handled by the backing fs; (3) is handled by tagging the file and storing coherency data in in an xattr (though tmpfiles are used on full invalidation). (6) is not yet supported. For upstream, (1), (2), (4) and to some extent (5) are handled through the backing fs. (3) is handled by storing coherency data in an xattr and truncating the file on invalidation; (6) is not yet supported. However, there are some performance problems are arising in my fscache-iter branch: (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate, setxattr). (2) It's retaining a lot of open file structs on cache files. Cachefiles opens the file when it's first asked to access it and retains that till the cookie is relinquished or the cache withdrawn (the file* doesn't contribute to ENFILE/EMFILE but it still eats memory). I can mitigate this by closing much sooner, perhaps opening the file for each operation - but at the cost of having to spend time doing more opens and closes. What's in upstream gets away without having to do open/close for reads because it calls readpage. Alternatively, I can have a background file closer - which requires an LRU queue. This could be combined with a file "sealer". Deferred writeback on the netfs starting writes to the cache makes this more interesting as I have to retain the interest on the cache object beyond the netfs file being closed. (3) Trimming excess data from the end of the cache file. The problem with using DIO to write to the cache is that the write has to be rounded up to a multiple of the backing fs DIO blocksize, but if the file is truncated larger, that excess data now becomes part of the file. Possibly it's sufficient to just clear the excess page space before writing, but that doesn't necessarily stop a writable mmap from scribbling on it. (4) Committing outstanding cache metadata at cache withdrawal or netfs unmount. I've previously mentioned this: it ends up with a whole slew of synchronous metadata changes being committed to the cache in one go (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) - and this can take quite a long time. The cache needs to be more proactive in getting stuff committed as it goes along. (5) Attaching to an object requires a pathwalk to it (normally only two steps) and then reading various xattrs on it - all synchronous, but can be punted to a background threadpool. Amongst the reasons I was considering moving to an index and a single datafile is to replace the path-lookup step for each object and the xattr reads to looking in a single file and to reduce the number of open files in the cache at any one time to around four. David