On Tue, Apr 01, 2014 at 07:20:23PM -0400, Brian Foster wrote: > On Wed, Apr 02, 2014 at 08:19:26AM +1100, Dave Chinner wrote: > > On Tue, Apr 01, 2014 at 09:55:18AM -0400, Brian Foster wrote: > > Yes, it means that there is a global flush when a project quota runs > > out of space, but it means that we only do it once and we don't burn > > excessive CPU walking radix trees scanning inodes needlessly every > > time we get a storm of processes hammering project quota ENOSPC. > > > > It's not clear to me that excess scanning is an issue, but it seems Have 100 threads hit ENOSPC on the same project quota at the same time on a filesystem with a couple of thousand AGs with a few million cached inode, and then you'll see the problem.... > orthogonal to how we organize the enospc logic (at least once the flush > is out of the equation). IOW, we'll invoke the scanner with the same > frequency either way. Or maybe you are just referring to scanning > specifically for the purpose of doing flushes as a waste..? Well, lots of concurrent scanning for the same purpose is highly inefficient - look at the amount of additional serialisation in the inode recalim walker that is aimed at reducing concurrency to one reclaim thread per AG at a time... I expect that if the serialisation on xfs_flush_inodes() isn't sufficient to throttle eofblock scanning concurrency in case like the above then we'll know about it pretty quickly. > > > simplify things. The helper can run the quota scan and/or the global > > > scan based on the data regarding the situation (i.e., the inode and > > > associated quota characteristics). This allows us to preserve the notion > > > of attempting a lightweight recovery first and a heavyweight recovery > > > second, reserving the inode flush for when space is truly tight. > > > Thoughts? > > > > It's still damn clunky, IMO. It's still trying to work around the > > fact that project quotas use ENOSPC rather than EDQUOT, and that > > makes the logic rather twisted. And it still has that hidden data > > flush in it so it can't really be called lightweight... > > > > Ok. Well if the high level logic is the issue, we could still turn that > inside out a bit to use your EDQUOT/ENOSPC logic, yet still preserve the > standalone eofblocks scan. It might be a few more lines, but perhaps > more clear. E.g.: > > if (ret == -EDQUOT && !enospc) { > enospc = 1; > xfs_inode_free_quota_eofblocks(ip); > goto retry; > else if (ret == -ENOSPC && !enospc) { > if (!scanned) { > struct xfs_eofblocks eofb = {0}; > ... > scanned = 1; > xfs_icache_free_eofblocks(ip->i_mount, &eofb); > goto retry; > } > > enospc = 1; > xfs_flush_inodes(ip->i_mount); > goto retry; > } > > Thoughts on that? Even more convoluted. :/ Look at it this way - I've never been a fan of this "retry write on enospc once" code. It's a necessary evil due to the reality of locking orders and having to back out of the write far enough to safely trigger dirty inode writeback so we *might* be able to write this data. Making this code more finicky and tricky is the last thing I think we should be doing. I don't have any better ideas at this point - I'm still tired and jetlagged and need to sort out everything for a merge window pull of the current tree, so time for me to think up a creative solution is limited. I'm hoping that you might be able to think of a different way entirely of doing this "write retry on ENOSPC" that dolves all these problems simply and easily ;) > We'll handle project quota ENOSPC just as we handle > global ENOSPC with respect to the scan and the flush, but we'll still > have the opportunity to free up a decent amount of space without > initiating I/O. The project quota scan down in free_quota_eofblocks() > might be rendered pointless as well. Well, therein lies an avenue of investigation: for EOF block trimming, why would we need to do a flush first, even for project quotas? And if we aren't going to flush data, why does it need to be done at this level? Can it be done deep in the quota code where we know exactly what quota ran out of space? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs