Re: sequential 1MB mmap read ends in 1 page sync read-ahead

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 4/14/23 06:07, Matthew Wilcox wrote:
On Thu, Apr 13, 2023 at 11:33:09PM +0200, Bernd Schubert wrote:
Sorry, forgot to add Andrew and linux-mm into CC.

On 4/13/23 23:27, Bernd Schubert wrote:
Hello,

I found a weird mmap read behavior while benchmarking the fuse-uring
patches.
I did not verify yet, but it does not look fuse specific.
Basically, I started to check because fio results were much lower
than expected (better with the new code, though)

fio cmd line:
fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse
--directory=/scratch/dest/ --rw=read multi-file.fio


bernd@squeeze1 test2>cat multi-file.fio
[global]
group_reporting
bs=1M
runtime=300

[test]

This sequential fio sets POSIX_MADV_SEQUENTIAL and then does memcpy
beginning at offset 0 in 1MB steps (verified with additional
logging in fios engines/mmap.c).

And additional log in fuse_readahead() gives

[ 1396.215084] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
index=0
[ 1396.237466] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
index=255
[ 1396.263175] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
index=254
[ 1396.282055] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
index=253
... <count is always 1 page>
[ 1496.353745] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
index=64
[ 1496.381105] fuse: 000000003fdec504 inode=00000000be0f29d3 count=64
index=511
[ 1496.397487] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
index=510
[ 1496.416385] fuse: 000000003fdec504 inode=00000000be0f29d3 count=1
index=509
... <count is always 1 page>

Logging in do_sync_mmap_readahead()

[ 1493.130764] do_sync_mmap_readahead:3015 ino=132 index=0 count=0
ras_start=0 ras_size=0 ras_async=0 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
[ 1493.147173] do_sync_mmap_readahead:3015 ino=132 index=255 count=0
ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
[ 1493.165952] do_sync_mmap_readahead:3015 ino=132 index=254 count=0
ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
[ 1493.185566] do_sync_mmap_readahead:3015 ino=132 index=253 count=0
ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
...
[ 1496.341890] do_sync_mmap_readahead:3015 ino=132 index=64 count=0
ras_start=0 ras_size=64 ras_async=32 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1
[ 1496.361385] do_sync_mmap_readahead:3015 ino=132 index=511 count=0
ras_start=96 ras_size=64 ras_async=64 ras_ra_pages=64 ras_mmap_miss=0
ras_prev_pos=-1


So we can see from fuse that it starts to read at page index 0, wants
64 pages (which is actually the double of bdi read_ahead_kb), then
skips index 64...254) and immediately goes to index 255. For the mmaped
memcpy pages are missing and then it goes back in 1 page steps to get
these.

A workaround here is to set read_ahead_kb in the bdi to a larger
value, another workaround might be (untested) to increase the read-ahead
window. Either of these two seem to be workarounds for the index order
above.

I understand that read-ahead gets limited by the bdi value (although
exceeded above), but why does it go back in 1 page steps? My expectation
would have been

index=0  count=32 (128kb read-head)
index=32 count=32
index=64 count=32

What I see with XFS is:

              fio-27518   [005] .....   276.565025: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x23a8c ofs=0 order=2
              fio-27518   [005] .....   276.565035: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x19868 ofs=16384 order=2
              fio-27518   [005] .....   276.565036: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x237fc ofs=32768 order=2
              fio-27518   [005] .....   276.565038: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x27518 ofs=49152 order=2
              fio-27518   [005] .....   276.565039: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14c7c ofs=65536 order=2
              fio-27518   [005] .....   276.565040: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14338 ofs=81920 order=2
              fio-27518   [005] .....   276.565041: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x201fc ofs=98304 order=2
              fio-27518   [005] .....   276.565042: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x1fb98 ofs=114688 order=2
              fio-27518   [005] .....   276.565044: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x14510 ofs=131072 order=2
              fio-27518   [005] .....   276.565045: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x1e88c ofs=147456 order=2
              fio-27518   [005] .....   276.565046: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x26f00 ofs=163840 order=2

...

  dev 8:32 ino 44 pfn=0x14f30 ofs=262144 order=4
              fio-27518   [005] .....   276.567718: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x145a0 ofs=327680 order=4
              fio-27518   [005] .....   276.567720: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15730 ofs=393216 order=4
              fio-27518   [005] .....   276.567722: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e30 ofs=458752 order=4
              fio-27518   [005] .....   276.567942: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x18b00 ofs=524288 order=6
              fio-27518   [005] .....   276.569982: mm_filemap_add_to_page_cache: dev 8:32 ino 44 pfn=0x15e40 ofs=786432 order=6

... it then gets "stuck" at order-6, which is expected for a 256kB
readahead window.

This is produced by:

echo 1 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
fio --size=1G --numjobs=1 --ioengine=mmap --output-format=normal,terse --directory=/mnt/scratch/ --rw=read multi-file.fio
echo 0 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
less /sys/kernel/tracing/trace


Thanks for looking at it Matthew!

I see the same as on fuse on xfs - same output as I initially
posted (except the fuse logging, of course).

With tracing and reduced file size to 2M

squeeze1:~# grep fio /sys/kernel/tracing/trace
             fio-3459    [018] ..... 65055.425435: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6b0 ofs=0 order=2
             fio-3459    [018] ..... 65055.425456: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6ac ofs=16384 order=2
             fio-3459    [018] ..... 65055.425464: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6a8 ofs=32768 order=2
             fio-3459    [018] ..... 65055.425472: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6a4 ofs=49152 order=2
             fio-3459    [018] ..... 65055.425480: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c6a0 ofs=65536 order=2
             fio-3459    [018] ..... 65055.425489: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c69c ofs=81920 order=2
             fio-3459    [018] ..... 65055.425497: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c698 ofs=98304 order=2
             fio-3459    [018] ..... 65055.425505: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c694 ofs=114688 order=2
             fio-3459    [018] ..... 65055.425513: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c690 ofs=131072 order=2
             fio-3459    [018] ..... 65055.425521: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c68c ofs=147456 order=2
             fio-3459    [018] ..... 65055.425529: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c688 ofs=163840 order=2
             fio-3459    [018] ..... 65055.425538: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c684 ofs=180224 order=2
             fio-3459    [018] ..... 65055.425545: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c680 ofs=196608 order=2
             fio-3459    [018] ..... 65055.425553: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c83c ofs=212992 order=2
             fio-3459    [018] ..... 65055.425561: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c838 ofs=229376 order=2
             fio-3459    [018] ..... 65055.425569: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c834 ofs=245760 order=2
             fio-3459    [011] ..... 65055.436500: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb8 ofs=1044480 order=0
             fio-3459    [011] ..... 65055.436530: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb9 ofs=1048576 order=0
             fio-3459    [011] ..... 65055.436534: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12cfb5 ofs=1052672 order=0

And then it is stuck at order=0

             fio-3459    [002] ..... 65060.353420: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12afc2 ofs=1306624 order=0

Not any different with increase file size to 10M.

squeeze1:~# cat /proc/self/mountinfo |grep "252:16"
88 28 252:16 / /scratch rw,relatime shared:44 - xfs /dev/vdb rw,attr2,inode64,logbufs=8,logbsize=32k,noquota


squeeze1:~# cat /sys/class/bdi/252\:16/read_ahead_kb
128



With increased RA to 1024kb

             fio-3568    [013] ..... 66481.269207: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12b8a0 ofs=0 order=2
             fio-3568    [013] ..... 66481.269224: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d5dc ofs=16384 order=2
             fio-3568    [013] ..... 66481.269233: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d8dc ofs=32768 order=2
             fio-3568    [013] ..... 66481.269242: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12aae4 ofs=49152 order=2
             fio-3568    [013] ..... 66481.269252: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12abb0 ofs=65536 order=2
...
             fio-3568    [013] ..... 66481.276736: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12da00 ofs=2097152 order=4
             fio-3568    [013] ..... 66481.276758: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d4f0 ofs=2162688 order=4
             fio-3568    [013] ..... 66481.276785: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d910 ofs=2228224 order=4
             fio-3568    [013] ..... 66481.276812: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12c3c0 ofs=2293760 order=4
             fio-3568    [013] ..... 66481.278920: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12d340 ofs=4194304 order=6
             fio-3568    [013] ..... 66481.279033: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12db80 ofs=4456448 order=6
             fio-3568    [013] ..... 66481.279183: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12dbc0 ofs=4718592 order=6
             fio-3568    [013] ..... 66481.279302: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e000 ofs=4980736 order=6
             fio-3568    [013] ..... 66481.279422: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e040 ofs=5242880 order=6
             fio-3568    [013] ..... 66481.279521: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e080 ofs=5505024 order=6
             fio-3568    [013] ..... 66481.279938: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e0c0 ofs=5767168 order=6
             fio-3568    [013] ..... 66481.280038: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e100 ofs=6029312 order=6
             fio-3568    [013] ..... 66481.282445: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e200 ofs=6291456 order=8
             fio-3568    [013] ..... 66481.283986: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12e300 ofs=7340032 order=8
             fio-3568    [013] ..... 66481.288688: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12dc00 ofs=8388608 order=9
             fio-3568    [013] ..... 66481.291731: mm_filemap_add_to_page_cache: dev 252:16 ino 84 pfn=0x12de00 ofs=10485760 order=9


Thanks,
Bernd



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux