Re: Major btrfs fiemap slowdown on file with many extents once in cache (RCU stalls?) (Was: [PATCH 1/3] filemap: Correct the conditions for marking a folio as accessed)

Yu Kuai <yukuai3@xxxxxxxxxx> · Fri, 1 Jul 2022 09:55:31 +0800

在 2022/06/30 15:29, Dominique MARTINET 写道:
Hi Willy, linux-btrfs@vger,

Matthew Wilcox (Oracle) wrote on Sun, Jun 19, 2022 at 04:11:41PM +0100:
We had an off-by-one error which meant that we never marked the first page
in a read as accessed.  This was visible as a slowdown when re-reading
a file as pages were being evicted from cache too soon.  In reviewing
this code, we noticed a second bug where a multi-page folio would be
marked as accessed multiple times when doing reads that were less than
the size of the folio.

when debugging an unrelated issue (short reads on btrfs with io_uring
and O_DIRECT[1]), I noticed that my horrible big file copy speeds fell
down from ~2GB/s (there's compression and lots of zeroes) to ~100MB/s
the second time I was copying it with cp.

Hi,

With this patch ctive_page() will be called the second time that page is
mark accessed, which has some extra overhead, however, 2GB/s -> 100MB/s
is insane, I'm not sure how this is possible, but it seems like it has
something to do with this change.(Noted that it's problematic that page
will not mark accessed before this patch).

BTW, during my test, the speed of buffer read in ext4 only fell down a
little.

Thanks,
Kuai
I've taken a moment to bisect this and came down to this patch.

[1] https://lore.kernel.org/all/YrrFGO4A1jS0GI0G@xxxxxxxxxxxxxxxxx/T/#u

Dropping caches (echo 3 > /proc/sys/vm/drop_caches) restore the speed,
so there appears to be some bad effect to having the file in cache for
fiemap?
To be fair that file is pretty horrible:
---
# compsize bigfile
Processed 1 file, 194955 regular extents (199583 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       15%      3.7G          23G          23G
none       100%      477M         477M         514M
zstd        14%      3.2G          23G          23G
---

Here's what perf has to say about it on top of this patch when running
`cp bigfile /dev/null` the first time:

98.97%     0.00%  cp       [kernel.kallsyms]    [k]
entry_SYSCALL_64_after_hwframe
  entry_SYSCALL_64_after_hwframe
  do_syscall_64
   - 93.40% ksys_read
      - 93.36% vfs_read
         - 93.25% new_sync_read
            - 93.20% filemap_read
               - 83.38% filemap_get_pages
                  - 82.76% page_cache_ra_unbounded
                     + 59.72% folio_alloc
                     + 13.43% read_pages
                     + 8.75% filemap_add_folio
                       0.64% xa_load
                    0.52% filemap_get_read_batch
               + 8.75% copy_page_to_iter
   - 4.73% __x64_sys_ioctl
      - 4.72% do_vfs_ioctl
         - btrfs_fiemap
            - 4.70% extent_fiemap
               + 3.95% btrfs_check_shared
               + 0.70% get_extent_skip_holes

and second time:
99.90%     0.00%  cp       [kernel.kallsyms]    [k]
entry_SYSCALL_64_after_hwfram
  entry_SYSCALL_64_after_hwframe
  do_syscall_64
   - 94.62% __x64_sys_ioctl
        do_vfs_ioctl
        btrfs_fiemap
      - extent_fiemap
         - 50.01% get_extent_skip_holes
            - 50.00% btrfs_get_extent_fiemap
               - 49.97% count_range_bits
                    rb_next
         + 28.72% lock_extent_bits
         + 15.55% __clear_extent_bit
   - 5.21% ksys_read
      + 5.21% vfs_read

(if this isn't readable, 95% of the time is spent on fiemap the second
time around)

I've also been observing RCU stalls on my laptop with the same workload
(cp to /dev/null), but unfortunately I could not reproduce in qemu so I
could not take traces to confirm they are caused by the same commit but
given the workload I'd say that is it?
I can rebuild a kernel for my laptop and confirm if you think it should
be something else.

I didn't look at the patch itself (yet) so have no suggestion at this
point - it's plausible the patch fixed something and just exposed slow
code that had been there all along so it might be better to look at the
btrfs side first, I don't know.
If you don't manage to reproduce I'll be happy to test anything thrown
at me at the very least.

Thanks,