On Thu, Jan 04, 2018 at 09:04:53AM -0500, Brian Foster wrote: > On Thu, Jan 04, 2018 at 08:00:15AM +1100, Dave Chinner wrote: > > On Wed, Jan 03, 2018 at 08:41:37AM -0500, Brian Foster wrote: > > > On Wed, Jan 03, 2018 at 10:59:10PM +1100, Dave Chinner wrote: > > > > In writing this, I think I can see a quick and simple change that > > > > will fix this case and improve most other directory grow workloads > > > > without affecting normal random directory insert/remove performance. > > > > That is, do a reverse order search starting at the last block rather > > > > than increasing order search starting at the first block..... > > > > > > > > Ok, now were are talking - performance and scalability improvements! > > > > > > > > create time(sec) / rate (files/s) > > > > File count vanilla loop-fix +reverse > > > > 10k 0.54 / 18.5k 0.53 / 18.9k 0.52 / 19.3k > > > > 20k 1.10 / 18.1k 1.05 / 19.0k 1.00 / 20.0k > > > > 100k 4.21 / 23.8k 3.91 / 25.6k 3.58 / 27.9k > > > > 200k 9.66 / 20,7k 7.37 / 27.1k 7.08 / 28.3k > > > > 1M 86.61 / 11.5k 48.26 / 20.7k 38.33 / 26.1k > > > > 2M 206.13 / 9.7k 129.71 / 15.4k 82.20 / 24.3k > > > > 10M 2843.57 / 3.5k 1817.39 / 5.5k 591.78 / 16.9k > > > > > > > > Theres still some non-linearity as we approach the 10M number, but > > > > it's still 5x faster to 10M inodes than the existing code.... > > > > > > > > > > Nice improvement.. I still need to look at the code, but a quick first > > > thought is that I wonder if there's somewhere we could stash a 'most > > > recent freeblock' once we have to grow the directory, even if just as an > > > in-core hint. Then we could jump straight to the latest block regardless > > > of the workload. ..... > > After sleeping on it, I suspect that there's a simple on-disk mod to > > the dir3 header that will improve the search function for all > > workloads. The dir3 free block header: > > > > struct xfs_dir3_free_hdr { > > struct xfs_dir3_blk_hdr hdr; > > __be32 firstdb; /* db of first entry */ > > __be32 nvalid; /* count of valid entries */ > > __be32 nused; /* count of used entries */ > > __be32 pad; /* 64 bit alignment */ > > }; > > > > has 32 bits of padding in it, and the most entries a free block can > > have is just under 2^15. Hence we can turn that into a "bestfree" > > entry that tracks the largest freespace indexed by the block. > > > > Then the free block scan can start by checking the required length > > against the largest freespace in the bestfree entry and skip the > > block search altogether if there isn't an indexed block with enough > > free space inside the freespace index block we are searching. > > So with this we'd still need to walk the range of free blocks, right? > I.e., we'd just be able to shortcut individual free block processing > (bests[] iteration) for those that obviously don't satisfy the request. Right. In the case of the 5M entry directory I was looking at, there were a total of 33 free index blocks. All the overhead was in scanning the 2016 entries in each block, not in iterating over the blocks. So if we can skip the entry scanning in a given block, then we have a general search improvement. i.e. it makes the free block index structure more like a skip list than a linear array. Also, if we keep the index into the free index block of the largest free space we have in the header, we can jump straight to it without needing to scan. Then when we modify the free index during the insert, we can do the "best free" scan at that point, similar to how we keep other bestfree indexes up to date. > Assuming I follow that correctly, that sounds reasonable from the > perspective that it certainly eliminates work. I suppose how worthwhile > it is probably depends on how much of an effect it has on the higher > level directory workload. IOW, is performance measurably improved by > skipping individual free block processing or is that cost mostly > amortized by having to read/walk all free blocks in the first place? According to the profile, the CPU cost of the read/walk of the free index blocks was within the noise floor. It didn't stand out as a major contributor, but I'd need to re-measure the overhead on a random insert/delete workload. However, my gut feel is that if we combined reverse order searching (for sequential insert) with block skipping (for random insert) we'll get get substantial improvements across all insert operations on large directories... > It may very well be a worthwhile optimization. I think the larger point > is that more isolated testing is probably required to confirm. The > improvement from this patch alone doesn't necessarily translate to an > answer one way or another because this patch creates conditions that > allow us to skip most of the scan altogether (by jumping right to the > most recently added block). Yeah, lots of validation work, but given the overhead for a non-trivial search distance I was measuring (>40% of CPU time @10M inodes) I think it's worth persuing. CHeers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html