On Mon, Dec 2, 2024 at 10:37 AM Bharata B Rao <bharata@xxxxxxx> wrote: > > On 28-Nov-24 10:01 AM, Mateusz Guzik wrote: > > > WIlly mentioned the folio wait queue hash table could be grown, you > > can find it in mm/filemap.c: > > 1062 #define PAGE_WAIT_TABLE_BITS 8 > > 1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS) > > 1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] > > __cacheline_aligned; > > 1065 > > 1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio) > > 1067 { > > 1068 │ return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)]; > > 1069 } > > > > Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out > > Flamegraph for "perf record --off-cpu -F 99 -a -g --all-kernel > --kernel-callchains -- sleep 120" is attached. > > Off-cpu samples were collected for 120s at around 45th minute run of the > FIO benchmark that actually runs for 1hr. This run was with kernel that > had your inode_lock fix but no changes to PAGE_WAIT_TABLE_BITS. > > Hopefully this captures the representative sample of the scalability > issue with folio lock. > I'm not familiar with the off-cpu option, fwiw does not look like any of that time got graphed. The thing that I know to work is offcputime-bpfcc. Regardless, per your own graph over half the *on* cpu time is spent spinning on the folio hash table locks. If bumping the size does not resolve the problem, the most likely contention shifts again to something else. So what we need is some profiling data from that state. -- Mateusz Guzik <mjguzik gmail.com>