On Wed, Mar 17, 2021 at 08:49:04AM -0700, Darrick J. Wong wrote: > On Wed, Mar 17, 2021 at 03:21:25PM +0000, Christoph Hellwig wrote: > > On Tue, Mar 16, 2021 at 08:47:29AM -0700, Darrick J. Wong wrote: > > > On Tue, Mar 16, 2021 at 07:27:10AM +0000, Christoph Hellwig wrote: > > > > Still digesting this. What trips me off a bit is the huge amount of > > > > duplication vs the inode reclaim mechanism. Did you look into sharing > > > > more code there and if yes what speaks against that? > > > > > > TBH I didn't look /too/ hard because once upon a time[1] Dave was aiming > > > to replace the inode reclaim tagging and iteration with an lru list walk > > > so I decided not to entangle the two. > > > > > > [1] https://lore.kernel.org/linux-xfs/20191009032124.10541-23-david@xxxxxxxxxxxxx/ > > > > Well, it isn't just the radix tree tagging, but mostly the > > infrastructure in iget that seems duplicates a lot of very delicate > > code. > > > > For the actual inactivation run: why don't we queue up the inodes > > for deactivation directly that, that use the work_struct in the > > inode to directly queue up the inode to the workqueue and let the > > workqueue manage the details? That also means we can piggy back on > > flush_work and flush_workqueue to force one or more entries out. > > > > Again I'm not saying I know this is better, but this is something that > > comes to my mind when reading the code. > > Hmm. You mean reuse i_ioend_work (which maybe we should just rename to > i_work) and queueing the inodes directly into the workqueue? I suppose > that would mean we don't even need the radix tree tag + inode walk... > > I hadn't thought about reusing i_ioend_work, since this patchset > predates the writeback ioend chaining. The biggest downside that I can > think of doing it that way is that right after a rm -rf, the unbound gc > workqueue will start hundreds of kworkers to deal with the sudden burst > of queued work, but all those workers will end up fighting each other > for (a) log grant space, and after that (b) the AGI buffer locks, and > meanwhile everything else on the frontend stalls on the log. yeah, this is not a good idea. The deferred inactivation needs to limit concurrency to a single work per AG at most because otherwise it will just consume all the reservation space serialising on the AGI locks. Even so, it can still starve the front end when they compete for AGI and AGF locks. Hence the background deferral is going to have to be very careful about how it obtains and blocks on locks.... (I haven't got that far iinto the patchset yet) > The other side benefit I can think of w.r.t. keeping the inactivation > work as a per-AG item is that (at least among AGs) we can walk the > inodes in disk order, which probably results in less seeking (note: I > haven't studied this) and might allow us to free inode cluster buffers > sooner in the rm -rf case. That is very useful because it allows the CIL to cancel the space used modifying the inodes and the cluster buffer during the unlink, allowing it to aggregate many more unlinks into the same checkpoint and avoid metadata writeback part way through unlink operations. i.e it is very efficient in terms of journal space consumption and hence journal IO bandwidth. (This is how we get multiple hundreds of thousands of items into a single 32MB journal checkpoint......) Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx