On 4/14/23 21:52, Bernd Schubert wrote:
On 4/14/23 17:05, Bernd Schubert wrote:
On 4/14/23 06:07, Matthew Wilcox wrote:
On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote:
Sorry, forgot to add Andrew and linux-mm into CC.
On 4/13/23 23:27, Bernd Schubert wrote:
Hello,
I found a weird mmap read behavior while benchmarking the fuse-uring
patches.
I did not verify yet, but it does not look fuse specific.
Basically, I started to check because fio results were much lower
than expected (better with the new code, though)
fio cmd line:
fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse
--directory=/scratch/dest/ --rw=read multi-file.fio
bernd@squeeze1 test2>cat multi-file.fio
[global]
group_reporting
bs=1M
runtime=300
[test]
This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy
beginning at offset 0 in 1MB steps (verified with additional
logging in fios engines/mmap.c).
And additional log in fuse_readahead() gives
[ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
index=0
[ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
index=255
[ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
index=254
[ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
index=253
... <count is always 1 page>
[ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
index=64
[ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
index=511
[ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
index=510
[ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
index=509
... <count is always 1 page>
Logging in do_sync_mmap_readahead()
[ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0
ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
[ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0
ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
[ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0
ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
[ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0
ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
...
[ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0
ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
[ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0
ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
So we can see from fuse that it starts to read at page index 0, wants
64 pages (which is actually the double of bdi read_ahead_kb), then
skips index 64...254) and immediately goes to index 255. For the
mmaped
memcpy pages are missing and then it goes back in 1 page steps to get
these.
A workaround here is to set read_ahead_kb in the bdi to a larger
value, another workaround might be (untested) to increase the
read-ahead
window. Either of these two seem to be workarounds for the index order
above.
I understand that read-ahead gets limited by the bdi value (although
exceeded above), but why does it go back in 1 page steps? My
expectation
would have been
index=0 count=32 (128kb read-head)
index=32 count=32
index=64 count=32
What I see with XFS is:
fio-27518 [005] ..... 276.565025:
mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2
...
mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432
order=6
... it then gets "stuck" at order-6, which is expected for a 256kB
readahead window.
This is produced by:
echo 1
>/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
fio --size=1G --numjobs=1 --ioengine=mmap
--output-format=normal,terse --directory=/mnt/scratch/ --rw=read
multi-file.fio
echo 0
>/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
less /sys/kernel/tracing/trace
Thanks for looking at it Matthew!
I see the same as on fuse on xfs - same output as I initially
posted (except the fuse logging, of course).
With tracing and reduced file size to 2M
...
fio-3459 [011] ..... 65055.436534:
mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb5
ofs=1052672 order=0
And then it is stuck at order=0
When I reduce bs to 4K I get similar results as you
fio --size=10M --numjobs=1 --ioengine=mmap --output-format=normal,terse
--directory=/scratch/source/ --rw=read multi-file.fio --bs=4K
--group_reporting
(so bs set to 4K)
and a basically empty job file
bernd@squeeze1 test2>cat multi-file.fio
[global]
[test]
Up to bs=512K it works fine, 1M (and for what it matters
already 768K) introduce the order=0 issue.
Hmm, I replaced memcpy with dumb version, that copies byte by byte -
problem gone. Is it possible that the optimized memcpy causes caused
kind of random memory access and confuses mm / readahead?
And somehow your memcpy or system is not doing that?
Thanks,
Bernd