On Fri 15-02-13 18:16:07, Dave Chinner wrote: > On Thu, Feb 14, 2013 at 02:14:52PM +0100, Jan Kara wrote: > > Hi, > > > > this is a follow up on a discussion started here: > > http://www.spinics.net/lists/xfs/msg14999.html > > > > To just quickly sum up the issue: > > When project quota gets exceeded XFS ends up flushing inodes using > > sync_inodes_sb(). I've tested (in 3.8-rc4) that if one writes 200 MB to a > > directory with 100 MB project quota like: > > fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644); > > for (i = 0; i < 50000; i++) > > pwrite(fd, buf, 4096, i*4096); > > it takes about 3 s to finish, which is OK. But when there are lots of > > inodes cached (I've tried with 10000 inodes cached on the fs), the same > > test program runs ~140 s. > > So, you're testing the overhead of ~25,000 ENOSPC flushes. I could > brush this off and say "stupid application" but I won't.... Yes, stupid... NFS. This is what happens when NFS client writes to a directory over project quota. So as much as I agree the workload is braindamaged we don't have a choice to fix a client and we didn't find a reasonable fix on NFS server side either. So that's why we ended up with XFS changes. > > This is because sync_inodes_sb() iterates over > > all inodes in superblock and waits for IO and this iteration eats CPU > > cycles. > > Yup, exactly what I said here: > > http://www.spinics.net/lists/xfs/msg15198.html > > Iterating inodes takes a lot of CPU. > > I think the difference in the old method and the current one is that > we only do one inode cache iteration per write(), not one per > get_blocks() call. Hence we've removed the per-page overhead of > flushing, and now we just have the inode cache iteration overhead. > > The fix to that problem is mentioned here: > > http://www.spinics.net/lists/xfs/msg15186.html > > Which is to: > > a) throttle speculative allocation as EDQUOT approaches; and > b) efficiently track speculative preallocation > for all the inodes in the given project, and write back > and trim those inodes on ENOSPC. > > both of those are still a work in progress. I was hoping that we'd > have a) in 3.9, but that doesn't seem likely now the merge window is > just about upon us.... Yeah, I know someone is working on a better solution. I was mostly wondering why writeback_inodes_sb() isn't enough - i.e. why do you really need to wait for IO completion. And you explained that below. Thanks! > It's trying to prevent the filesystem for falling into worst case IO > patterns at ENOSPC when there are hundreds of threads banging on the > filesystem and essentially locking up the system. i.e. we have to > throttle the IO submission rate from ENOSPC flushing artificially - > we really only need one thread doing the submission work, so we need > to throttle all concurrent callers while we are doing that work. > sync_inodes_sb() does that. And then we need to prevent individual > callers from trying to allocate space too frequently, which we do by > waiting for IO that was submitted in the flush. sync_inodes_sb() > does that, too. I see so the waiting for IO isn't really a correctness thing but just a way to slow down processes bringing fs out of space so that they don't lock up the system. Thanks for explanation! Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs