Re: [PATCH] xfs: speed up directory bestfree block scanning

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 4 Jan 2018 09:04:53 -0500

On Thu, Jan 04, 2018 at 08:00:15AM +1100, Dave Chinner wrote:
> On Wed, Jan 03, 2018 at 08:41:37AM -0500, Brian Foster wrote:
> > On Wed, Jan 03, 2018 at 10:59:10PM +1100, Dave Chinner wrote:
> > > In writing this, I think I can see a quick and simple change that
> > > will fix this case and improve most other directory grow workloads
> > > without affecting normal random directory insert/remove performance.
> > > That is, do a reverse order search starting at the last block rather
> > > than increasing order search starting at the first block.....
> > > 
> > > Ok, now were are talking - performance and scalability improvements!
> > > 
> > > 		create time(sec) / rate (files/s)
> > >  File count     vanilla		    loop-fix		+reverse
> > >    10k	      0.54 / 18.5k	   0.53 / 18.9k	       0.52 / 19.3k
> > >    20k	      1.10 / 18.1k	   1.05 / 19.0k	       1.00 / 20.0k
> > >   100k	      4.21 / 23.8k	   3.91 / 25.6k	       3.58 / 27.9k
> > >   200k	      9.66 / 20,7k	   7.37 / 27.1k	       7.08 / 28.3k
> > >     1M	     86.61 / 11.5k	  48.26 / 20.7k	      38.33 / 26.1k
> > >     2M	    206.13 /  9.7k	 129.71 / 15.4k	      82.20 / 24.3k
> > >    10M	   2843.57 /  3.5k	1817.39 /  5.5k      591.78 / 16.9k
> > > 
> > > Theres still some non-linearity as we approach the 10M number, but
> > > it's still 5x faster to 10M inodes than the existing code....
> > > 
> > 
> > Nice improvement.. I still need to look at the code, but a quick first
> > thought is that I wonder if there's somewhere we could stash a 'most
> > recent freeblock' once we have to grow the directory, even if just as an
> > in-core hint. Then we could jump straight to the latest block regardless
> > of the workload.
> 
> I thought about that, and then wondered where to stash it, then
> wondered whether it would miss smaller, better fitting blocks, and
> then finally realised we didn't need to have cross-operation state
> to solve the common case of growing directories.
> 

Agreed with regard to the growing directories scenario. The thought was
more around first: avoiding any potential negative effects for other
workloads, and second: perhaps constructing a more generally applicable
optimization (cases beyond purely seqential directory grow).

> > Hmm, thinking a little more about it, that may not be worth the
> > complication since part of the concept of "search failure" in this case
> > is tied to the size of the entry we want to add. Then again, I suppose
> > such is the case when searching forward/backward as well (i.e., one
> > large insert fails, grows inode, subsequent small insert may very well
> > have succeeded with the first freeblock, though now we'd always start at
> > the recently allocated block at the end).
> 
> Right. random hole filling in the directory shouldn't be greatly
> affected by forward or reverse search order - the eventual search
> distances are all the same. It does, OTOH, matter greatly for
> sequntial inserts...
> 

Yeah, I think that's a reasonable argument for simply swapping the
search order (i.e., addresses the first concern above).

> After sleeping on it, I suspect that there's a simple on-disk mod to
> the dir3 header that will improve the search function for all
> workloads. The dir3 free block header:
> 
> struct xfs_dir3_free_hdr {
>         struct xfs_dir3_blk_hdr hdr;
>         __be32                  firstdb;        /* db of first entry */
>         __be32                  nvalid;         /* count of valid entries */
>         __be32                  nused;          /* count of used entries */
>         __be32                  pad;            /* 64 bit alignment */
> };
> 
> has 32 bits of padding in it, and the most entries a free block can
> have is just under 2^15. Hence we can turn that into a "bestfree"
> entry that tracks the largest freespace indexed by the block.
> 
> Then the free block scan can start by checking the required length
> against the largest freespace in the bestfree entry and skip the
> block search altogether if there isn't an indexed block with enough
> free space inside the freespace index block we are searching.
> 

So with this we'd still need to walk the range of free blocks, right?
I.e., we'd just be able to shortcut individual free block processing
(bests[] iteration) for those that obviously don't satisfy the request.

Assuming I follow that correctly, that sounds reasonable from the
perspective that it certainly eliminates work. I suppose how worthwhile
it is probably depends on how much of an effect it has on the higher
level directory workload. IOW, is performance measurably improved by
skipping individual free block processing or is that cost mostly
amortized by having to read/walk all free blocks in the first place?

It may very well be a worthwhile optimization. I think the larger point
is that more isolated testing is probably required to confirm. The
improvement from this patch alone doesn't necessarily translate to an
answer one way or another because this patch creates conditions that
allow us to skip most of the scan altogether (by jumping right to the
most recently added block).

> That's a lot more work than just reversing the search order, but I
> think it's a mod we should (eventually) make because it is an
> improvement for all insert workloads, not just growing.
> 
> The other (far more complex) option is to turn the freespace index
> into a btree, like we do with the hash indexes. Not sure we need to
> spend that much effort on this right now, though.
> 

*nod*

Brian

> > Anyways, just thinking out loud (and recovering from several weeks
> > vacation). :P
> 
> Welcome back :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html