Re: [PATCH 01/10] xfs: create simplified inode walk function

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 12 Jun 2019 08:33:41 +1000

On Mon, Jun 10, 2019 at 04:11:34PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 10, 2019 at 01:55:10PM -0400, Brian Foster wrote:
> > > I could extend the comment to explain why we don't use PAGE_SIZE...
> > > 
> > 
> > Sounds good, though what I think would be better is to define a
> > IWALK_DEFAULT_RECS or some such somewhere and put the calculation
> > details with that.
> > 
> > Though now that you point out the readahead thing, aren't we at risk of
> > a similar problem for users who happen to pass a really large userspace
> > buffer? Should we cap the kernel allocation/readahead window in all
> > cases and not just the default case?
> 
> Hmm, that's right, we don't want to let userspace arbitrarily determine
> the size of the buffer, and I think the current implementation caps it
> the readahaead at ... oh, PAGE_SIZE / sizeof(xfs_inogrp_t).
> 
> Oh, right, and in the V1 patchset Dave said that we should constrain
> readahead even further.

Right, I should explain a bit further why, too - it's about
performance.  I've found that a user buffer size of ~1024 inodes is
generally enough to max out performance of bulkstat. i.e. somewhere
around 1000 inodes per syscall is enough to mostly amortise all of
the cost of syscall, setup, readahead, etc vs the CPU overhead of
copying all the inodes into the user buffer.

Once the user buffer goes over a few thousand inodes, performance
then starts to tail back off - we don't get any gains from trying to
bulkstat tens of thousands of inodes at a time, especially under
memory pressure because that can push us into readahead and buffer
cache thrashing.

> > > /*
> > >  * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
> > >  * constrain the amount of inode readahead to 16k inodes regardless of CPU:
> > >  *
> > >  * 4096 bytes / 16 bytes per inobt record = 256 inobt records
> > >  * 256 inobt records * 64 inodes per record = 16384 inodes
> > >  * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
> > >  */

Hence I suspect that even this is overkill - it makes no sense to
have a huge readahead window when there has been no measurable
performance benefit to doing large inode count bulkstat syscalls.

And, FWIW, readahead probably should also be capped at what the user
buffer can hold - no point in reading 16k inodes when the output
buffer can only fit 1000 inodes...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx