Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > There was a discussion about fsyncing a range of files on LSFMM [1]. > > > In the last comment on the article dchinner argues why we already have that > > > API (and now also with io_uring(), but AFAIK, we do not have a useful > > > wait_for_sync() API. And it doesn't need to be exposed to userspace at all. > > > > > > [1] https://lwn.net/Articles/789024/ > > > > This sounds like an interesting idea. Actually, what I probably want is a > > notification to say that a particular object has been completely sync'd to > > disk, metadata and all. > > This isn't hard to do yourself in the kernel. All it takes is a > workqueue to run vfs_fsync() calls asynchronously and for the work > to queue a local notification/wakeup when the fsync completes... > > That's all aio_fsync() does - the notification it queues on > completion is the AIO completion event for userspace - so I think > you could do this in about 50 lines of code if you really needed > it... I was thinking more in terms of passively finding out when metadata has been flushed to disk rather than actively forcing it. Obviously I can manually flush from a worker thread, but that ties up a thread per file I want to flush (unless I want to do a higher-level sync). Btw, looking at aio_fsync(), is there any reason it copies the current creds rather than just taking a ref on them? (Granted, this may not be a question for you) > > However, there are some performance problems are arising in my fscache-iter > > branch: > > > > (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate, > > setxattr). > > Async pipelines using unbound workqueues are your friend. Maybe. I could just throw everything into a workqueue and let the workqueue deal with it. There still have to be synchronisation points, though - I can't schedule a cache-write from a server-read to the cache following a 3rd-party induced invalidation until after the invalidation has happened - and that holds up userspace from writing to the cache. But maybe it will work. Btw, how expensive is it to throw an operation off to a workqueue versus doing it in thread? Particularly if it's a synchronous op that the thread is going to have to wait for (e.g. write_begin()). > > (2) It's retaining a lot of open file structs on cache files. Cachefiles > > opens the file when it's first asked to access it and retains that till > > the cookie is relinquished or the cache withdrawn (the file* doesn't > > contribute to ENFILE/EMFILE but it still eats memory). > > Sounds similar to the problem that the NFSd open file cache solves. > (fs/nfsd/filecache.c) Looks similiarish to what I was thinking of with having a queue of currently-not-in-use cookies to go through and commit and close. > > but if the file is truncated > > larger, that excess data now becomes part of the file. > > Keep the actual file size in your tracking xattr. I do that, but it doesn't help entirely. If someone truncates the file larger and then writes non-contiguously, the problem occurs. I've tried truncating the file down and then truncating it up, but that requires two synchronous ops - though the latter is relatively cheap. I've also tried fallocate() to clear the block. What I've found is that the next DIO write then has to sync because these may read data into the pagecache of the backing file. Apart from clearing the tail of a page on writing, it might be better for me to read the data into a spare page, clear the tail and write it back. > > Possibly it's sufficient to just clear the excess page space before > > writing, but that doesn't necessarily stop a writable mmap from > > scribbling on it. > > We can't stop mmap from scribbling in it. All filesystems have this > problem, so to prevent data leaks we have to zero the post-eof tail > region on every write of the EOF block, anyway. I meant an mmap scribbling on it after it's been cleared - but I guess taking away the PTE-writeable flag and making page_mkwrite() wait should solve that. > > (4) Committing outstanding cache metadata at cache withdrawal or netfs > > unmount. I've previously mentioned this: it ends up with a whole > > slew of synchronous metadata changes being committed to the cache in > > one go (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) > > - and this can take quite a long time. The cache needs to be more > > proactive in getting stuff committed as it goes along. > > Workqueues give you an easy mechanism for async dispatch and > concurrency for synchronous operations. This is a largely solved > problem... Yes and no. Yes, I can fan out the number of threads doing the committing, but there's still a limit on the I/O bandwidth - and a lot of the operations still have to hit the disk in the right order. It still stuffs up the user experience if the cache eats up the entirety of the disk I/O for a few seconds just because an automount expired. Probably the progressive committing approach is a better one so that there's less to do at the end. > > (5) Attaching to an object requires a pathwalk to it (normally only two > > steps) and then reading various xattrs on it - all synchronous, but can > > be punted to a background threadpool. > > a.k.a. punting to a workqueue :) I do that, but it doesn't help so much. Whilst it can mitigate the effect by running parallel to userspace, userspace tends to move pretty quickly from open() to read() - at which point we have to wait anyway. The problem is that all the steps are synchronous and, for the most part, have to be sequential because there's a dependency chain: 2 x dir-lookup, get LSM xattrs, get cache xattrs - then read the data if it's present. I might be able to speculate at the end and read two cache xattrs in parallel, but each one requires a separate thread to do it. On top of that, if the user is running a parallel application such as building a kernel, a CPU running an offloaded I/O thread isn't running a user thread. What I've found is that increasing the size of the threadpool doesn't actually affect the time taken. What I've done in my fscache-iter branch is to have a small thread pool and offload work to it if there's a thread free - otherwise process the work in the calling userspace thread and avoid the context switching. One reason I was wondering about moving to an approach whereby I have an index that locates all the blocks (which are then kept in a single file) is that I can probably keep the entire index in RAM and so the lookup costs are vastly reduced. The downside as Amir pointed out is that metadata coherency is much harder if I don't just want to blow the cache away if cache isn't properly committed when the machine is rebooted. Note that OpenAFS has been using a single-index approach, with each 256K block of data in its own file. They then zap any file that's newer than the index file when the cache is started, assuming that that file might be corrupted. David -- Linux-cachefs mailing list Linux-cachefs@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/linux-cachefs