Hi, On 2025-01-27 15:09:23 +0100, David Hildenbrand wrote: > Hmmm ... do we really want to make refcounting more complicated, and more > importantly, hugetlb-refcounting more special ?! :) I don't know the answer to that - I mainly wanted to report the issue because it was pretty nasty to debug and initially surprising (to me). > If the workload doing a lot of single-page try_grab_folio_fast(), could it > do so on a larger area (multiple pages at once -> single refcount update)? In the original case I hit this I (a VM with 10 PCIe 3x NVMEs JBODed together), the IO size averaged something like ~240kB (most 256kB, with some smaller ones thrown in). Increasing the IO size further than that starts to hurt latency and thus requires even deeper IO queues... Unfortunately for the VMs with those disks I don't have access to hardware performance counters :(. > Maybe there is a link to the report you could share, thanks. A profile of the "original" case where I hit this, without the patch that Willy linked to: Note this is a profile *not* using hardware perf counters, thus likely to be rather skewed: https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc (this was on Debian Sid's 6.12.6) Without the patch I achieved ~18GB/s with 1GB pages and ~35GB/s with 2MB pages. After applying the patch to add an unlocked already-dirty check to bio_set_pages_dirty() performance improves to ~20GB/s when using 1GB pages. A differential profile comparing 2MB and 1GB pages with the patch applied (again, without hardware perf counters): https://gist.github.com/anarazel/f993c238ea7d2c34f44440336d90ad8f Willy then asked me for perf annotate of where in gup_fast_fallback() time is spent. I didn't have access to the VM at that point, and tried to repro the problem with local hardware. As I don't have quite enough IO throughput available locally, I couldn't repro the problem quite as easily. But after lowering the average IO size (which is not unrealistic, far from every workload is just a bulk sequential scan), it showed up when just using two PCIe 4 NVMe SSDs. Here are profiles of the 2MB and 1GB cases, with the bio_set_pages_dirty() patch applied: https://gist.github.com/anarazel/f0d0a884c55ee18851dc9f15f03f7583 2MB pages get ~12.5GB/s, 1GB pages ~7GB/s, with a *lot* of variance. This time it's actual hardware perf counters... Relevant details about the c2c report, excerpted from IRC: andres | willy: Looking at a bit more detail into the c2c report, it looks like the dirtying is due to folio->_pincount and folio->_refcount in about equal measure and folio->flags being modified in gup_fast_fallback(). The modifications then, unsurprisingly, cause a lot of cache misses for reads (like in bio_set_pages_dirty() and bio_check_pages_dirty()). willy | andres: that makes perfect sense, thanks willy | really, the only way to fix that is to split it up willy | and either we can split it per-cpu or per-physical-address-range andres | willy: Yea, that's probably the only fundamental fix. I guess there might be some around-the-edges improvements by colocating the write heavy data on a separate cache line from flags and whatever is at 0x8, which are read more often than written. But I really don't know enough about how all this is used. willy | 0x8 is compound_head which is definitely read more often than written Greetings, Andres Freund