Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression

Jan Kara <jack@xxxxxxx> · Thu, 22 Feb 2024 19:37:56 +0100

On Thu 22-02-24 12:50:32, Jan Kara wrote:
> On Thu 22-02-24 09:32:52, Oliver Sang wrote:
> > On Wed, Feb 21, 2024 at 12:14:25PM +0100, Jan Kara wrote:
> > > On Tue 20-02-24 16:25:37, kernel test robot wrote:
> > > > kernel test robot noticed a -21.4% regression of vm-scalability.throughput on:
> > > > 
> > > > commit: ab4443fe3ca6298663a55c4a70efc6c3ce913ca6 ("readahead: avoid multiple marked readahead pages")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > 
> > > > testcase: vm-scalability
> > > > test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
> > > > parameters:
> > > > 
> > > > 	runtime: 300s
> > > > 	test: lru-file-readtwice
> > > > 	cpufreq_governor: performance
> > > 
> > > JFYI I had a look into this. What the test seems to do is that it creates
> > > image files on tmpfs, loopmounts XFS there, and does reads over file on
> > > XFS. But I was not able to find what lru-file-readtwice exactly does,
> > > neither I was able to reproduce it because I got stuck on some missing Ruby
> > > dependencies on my test system yesterday.
> > 
> > what's your OS?
> 
> I have SLES15-SP4 installed in my VM. What was missing was 'git' rubygem
> which apparently is not packaged at all and when I manually installed it, I
> was still hitting other problems so I rather went ahead and checked the
> vm-scalability source and wrote my own reproducer based on that.
> 
> I'm now able to reproduce the regression in my VM so I'm investigating...

So I was experimenting with this. What the test does is it creates as many
files as there are CPUs, files are sized so that their total size is 8x the
amount of available RAM. For each file two tasks are started which
sequentially read the file from start to end. Trivial repro from my VM with
8 CPUs and 64GB of RAM is like:

truncate -s 60000000000 /dev/shm/xfsimg
mkfs.xfs /dev/shm/xfsimg
mount -t xfs -o loop /dev/shm/xfsimg /mnt
for (( i = 0; i < 8; i++ )); do truncate -s 60000000000 /mnt/sparse-file-$i; done
echo "Ready..."
sleep 3
echo "Running..."
for (( i = 0; i < 8; i++ )); do
	dd bs=4k if=/mnt/sparse-file-$i of=/dev/null &
	dd bs=4k if=/mnt/sparse-file-$i of=/dev/null &
done 2>&1 | grep "copied"
wait
umount /mnt

The difference between slow and fast runs seems to be in the amount of
pages reclaimed with direct reclaim - after commit ab4443fe3c we reclaim
about 10% of pages with direct reclaim, before commit ab4443fe3c only about
1% of pages is reclaimed with direct reclaim. In both cases we reclaim the
same amount of pages corresponding to the total size of files so it isn't
the case that we would be rereading one page twice.

I suspect the reclaim difference is because after commit ab4443fe3c we
trigger readahead somewhat earlier so our effective workingset is somewhat
larger. This apparently gives harder time to kswapd and we end up with
direct reclaim more often.

Since this is a case of heavy overload on the system, I don't think the
throughput here matters that much and AFAICT the readahead code does
nothing wrong here. So I don't think we need to do anything here.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR