Re: Direct I/O performance problems with 1GB pages

Andres Freund <andres@xxxxxxxxxxx> · Mon, 27 Jan 2025 12:25:15 -0500

Hi,

On 2025-01-27 15:09:23 +0100, David Hildenbrand wrote:
> Hmmm ... do we really want to make refcounting more complicated, and more
> importantly, hugetlb-refcounting more special ?! :)

I don't know the answer to that - I mainly wanted to report the issue because
it was pretty nasty to debug and initially surprising (to me).

> If the workload doing a lot of single-page try_grab_folio_fast(), could it
> do so on a larger area (multiple pages at once -> single refcount update)?

In the original case I hit this I (a VM with 10 PCIe 3x NVMEs JBODed
together), the IO size averaged something like ~240kB (most 256kB, with some
smaller ones thrown in). Increasing the IO size further than that starts to
hurt latency and thus requires even deeper IO queues...

Unfortunately for the VMs with those disks I don't have access to hardware
performance counters :(.

> Maybe there is a link to the report you could share, thanks.

A profile of the "original" case where I hit this, without the patch that
Willy linked to:

Note this is a profile *not* using hardware perf counters, thus likely to be
rather skewed:
https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc
(this was on Debian Sid's 6.12.6)

Without the patch I achieved ~18GB/s with 1GB pages and ~35GB/s with 2MB
pages.

After applying the patch to add an unlocked already-dirty check to
bio_set_pages_dirty() performance improves to ~20GB/s when using 1GB pages.

A differential profile comparing 2MB and 1GB pages with the patch applied
(again, without hardware perf counters):
https://gist.github.com/anarazel/f993c238ea7d2c34f44440336d90ad8f

Willy then asked me for perf annotate of where in gup_fast_fallback() time is
spent.  I didn't have access to the VM at that point, and tried to repro the
problem with local hardware.

As I don't have quite enough IO throughput available locally, I couldn't repro
the problem quite as easily. But after lowering the average IO size (which is
not unrealistic, far from every workload is just a bulk sequential scan), it
showed up when just using two PCIe 4 NVMe SSDs.

Here are profiles of the 2MB and 1GB cases, with the bio_set_pages_dirty()
patch applied:
https://gist.github.com/anarazel/f0d0a884c55ee18851dc9f15f03f7583

2MB pages get ~12.5GB/s, 1GB pages ~7GB/s, with a *lot* of variance.

This time it's actual hardware perf counters...

Relevant details about the c2c report, excerpted from IRC:

andres | willy: Looking at a bit more detail into the c2c report, it looks
         like the dirtying is due to folio->_pincount and folio->_refcount in
         about equal measure and folio->flags being modified in
         gup_fast_fallback(). The modifications then, unsurprisingly, cause a
         lot of cache misses for reads (like in bio_set_pages_dirty() and
         bio_check_pages_dirty()).

 willy | andres: that makes perfect sense, thanks
 willy | really, the only way to fix that is to split it up
 willy | and either we can split it per-cpu or per-physical-address-range

andres | willy: Yea, that's probably the only fundamental fix. I guess there
         might be some around-the-edges improvements by colocating the write
         heavy data on a separate cache line from flags and whatever is at
         0x8, which are read more often than written. But I really don't know
         enough about how all this is used.

 willy | 0x8 is compound_head which is definitely read more often than written

Greetings,

Andres Freund