* Nicholas Piggin <npiggin@xxxxxxxxx> wrote: > The page waitqueue hash is a bit small (256 entries) on very big systems. A > 16 socket 1536 thread POWER9 system was found to encounter hash collisions > and excessive time in waitqueue locking at times. This was intermittent and > hard to reproduce easily with the setup we had (very little real IO > capacity). The theory is that sometimes (depending on allocation luck) > important pages would happen to collide a lot in the hash, slowing down page > locking, causing the problem to snowball. > > An small test case was made where threads would write and fsync different > pages, generating just a small amount of contention across many pages. > > Increasing page waitqueue hash size to 262144 entries increased throughput > by 182% while also reducing standard deviation 3x. perf before the increase: > > 36.23% [k] _raw_spin_lock_irqsave - - > | > |--34.60%--wake_up_page_bit > | 0 > | iomap_write_end.isra.38 > | iomap_write_actor > | iomap_apply > | iomap_file_buffered_write > | xfs_file_buffered_aio_write > | new_sync_write > > 17.93% [k] native_queued_spin_lock_slowpath - - > | > |--16.74%--_raw_spin_lock_irqsave > | | > | --16.44%--wake_up_page_bit > | iomap_write_end.isra.38 > | iomap_write_actor > | iomap_apply > | iomap_file_buffered_write > | xfs_file_buffered_aio_write > > This patch uses alloc_large_system_hash to allocate a bigger system hash > that scales somewhat with memory size. The bit/var wait-queue is also > changed to keep code matching, albiet with a smaller scale factor. > > A very small CONFIG_BASE_SMALL option is also added because these are two > of the biggest static objects in the image on very small systems. > > This hash could be made per-node, which may help reduce remote accesses > on well localised workloads, but that adds some complexity with indexing > and hotplug, so until we get a less artificial workload to test with, > keep it simple. > > Signed-off-by: Nicholas Piggin <npiggin@xxxxxxxxx> > --- > kernel/sched/wait_bit.c | 30 +++++++++++++++++++++++------- > mm/filemap.c | 24 +++++++++++++++++++++--- > 2 files changed, 44 insertions(+), 10 deletions(-) > > diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c > index 02ce292b9bc0..dba73dec17c4 100644 > --- a/kernel/sched/wait_bit.c > +++ b/kernel/sched/wait_bit.c > @@ -2,19 +2,24 @@ > /* > * The implementation of the wait_bit*() and related waiting APIs: > */ > +#include <linux/memblock.h> > #include "sched.h" > > -#define WAIT_TABLE_BITS 8 > -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS) Ugh, 256 entries is almost embarrassingly small indeed. I've put your patch into sched/core, unless Andrew is objecting. > - for (i = 0; i < WAIT_TABLE_SIZE; i++) > + if (!CONFIG_BASE_SMALL) { > + bit_wait_table = alloc_large_system_hash("bit waitqueue hash", > + sizeof(wait_queue_head_t), > + 0, > + 22, > + 0, > + &bit_wait_table_bits, > + NULL, > + 0, > + 0); > + } > + for (i = 0; i < BIT_WAIT_TABLE_SIZE; i++) > init_waitqueue_head(bit_wait_table + i); Meta suggestion: maybe the CONFIG_BASE_SMALL ugliness could be folded into alloc_large_system_hash() itself? > --- a/mm/filemap.c > +++ b/mm/filemap.c > static wait_queue_head_t *page_waitqueue(struct page *page) > { > - return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)]; > + return &page_wait_table[hash_ptr(page, page_wait_table_bits)]; > } I'm wondering whether you've tried to make this NUMA aware through page->node? Seems like another useful step when having a global hash ... Thanks, Ingo