On Tue, Jun 11, 2019 at 04:05:14PM -0700, Darrick J. Wong wrote: > On Wed, Jun 12, 2019 at 08:33:41AM +1000, Dave Chinner wrote: > > On Mon, Jun 10, 2019 at 04:11:34PM -0700, Darrick J. Wong wrote: > > > On Mon, Jun 10, 2019 at 01:55:10PM -0400, Brian Foster wrote: > > > > > I could extend the comment to explain why we don't use PAGE_SIZE... > > > > > > > > > > > > > Sounds good, though what I think would be better is to define a > > > > IWALK_DEFAULT_RECS or some such somewhere and put the calculation > > > > details with that. > > > > > > > > Though now that you point out the readahead thing, aren't we at risk of > > > > a similar problem for users who happen to pass a really large userspace > > > > buffer? Should we cap the kernel allocation/readahead window in all > > > > cases and not just the default case? > > > > > > Hmm, that's right, we don't want to let userspace arbitrarily determine > > > the size of the buffer, and I think the current implementation caps it > > > the readahaead at ... oh, PAGE_SIZE / sizeof(xfs_inogrp_t). > > > > > > Oh, right, and in the V1 patchset Dave said that we should constrain > > > readahead even further. > > > > Right, I should explain a bit further why, too - it's about > > performance. I've found that a user buffer size of ~1024 inodes is > > generally enough to max out performance of bulkstat. i.e. somewhere > > around 1000 inodes per syscall is enough to mostly amortise all of > > the cost of syscall, setup, readahead, etc vs the CPU overhead of > > copying all the inodes into the user buffer. > > > > Once the user buffer goes over a few thousand inodes, performance > > then starts to tail back off - we don't get any gains from trying to > > bulkstat tens of thousands of inodes at a time, especially under > > memory pressure because that can push us into readahead and buffer > > cache thrashing. > > <nod> I don't mind setting the max inobt record cache buffer size to a > smaller value (1024 bytes == 4096 inodes readahead?) so we can get a > little farther into future hardware scalability (or decreases in syscall > performance :P). > > I guess the question here is how to relate the number of inodes the user > asked for to how many inobt records we have to read to find that many > allocated inodes? Or in other words, what's the average ir_freecount > across all the inobt records? Partial results computed by using xfs_db to dump all the inobt records in the fs to look at averge freecounts, and the number of inobt records with zero freecount: Development workstation / and /home: 4678 total, zerofree 72.49%, avg freecount 4.24 107940 total, zerofree 94.84%, avg freecount 0.78 This laptop /, /home, and /boot: 4613 total, zerofree 80.73%, avg freecount 3.11 49660 total, zerofree 99.54%, avg freecount 0.04 10 total, zerofree 20.00%, avg freecount 27.40 Backup server / and /home: 3897 total, zerofree 22.99%, avg freecount 27.08 55995 total, zerofree 99.87%, avg freecount 0.01 (Note that the root fs is nearly out of space now, thanks journald...) xfstests host / and $TEST_DIR: 1834 total, zerofree 76.28%, avg freecount 3.31 20097 total, zerofree 83.41%, avg freecount 3.62 The script I used to generate these reports is attached. From this admittedly cursory output I "conclude" that bulkstat could get away with prefetching (icount / 48) inobt records up to a max of 1000 inodes. A more conservative estimate would be (icount / 32) inobt records. --D > > Note that this is technically a decrease since the old code would > reserve 16K for this purpose... > > > > > > /* > > > > > * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to > > > > > * constrain the amount of inode readahead to 16k inodes regardless of CPU: > > > > > * > > > > > * 4096 bytes / 16 bytes per inobt record = 256 inobt records > > > > > * 256 inobt records * 64 inodes per record = 16384 inodes > > > > > * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead > > > > > */ > > > > Hence I suspect that even this is overkill - it makes no sense to > > have a huge readahead window when there has been no measurable > > performance benefit to doing large inode count bulkstat syscalls. > > > > And, FWIW, readahead probably should also be capped at what the user > > buffer can hold - no point in reading 16k inodes when the output > > buffer can only fit 1000 inodes... > > It already is -- the icount parameter from userspace is (eventually) fed > to xfs_iwalk-set_prefetch. > > --D > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx
Attachment:
xfsinoload.sh
Description: Bourne shell script