Reviewed-by: Mikulas Patocka <mpatocka@xxxxxxxxxx> On Sun, 8 Aug 2021, Arne Welzel wrote: > On many core systems using dm-crypt, heavy spinlock contention in > percpu_counter_compare() can be observed when the dmcrypt page allocation > limit for a given device is reached or close to be reached. This is due > to percpu_counter_compare() taking a spinlock to compute an exact > result on potentially many CPUs at the same time. > > Switch to non-exact comparison of allocated and allowed pages by using > the value returned by percpu_counter_read_positive(). > > This may over/under estimate the actual number of allocated pages by at > most (batch-1) * num_online_cpus() (assuming my understanding of the > percpu_counter logic is proper). > > Currently, batch is bounded by 32. The system on which this issue was > first observed has 256 CPUs and 512G of RAM. With a 4k page size, this > change may over/under estimate by 31MB. With ~10G (2%) allowed for dmcrypt > allocations, this seems an acceptable error. Certainly preferred over > running into the spinlock contention. > > This behavior was separately/artificially reproduced on an EC2 c5.24xlarge > instance system with 96 CPUs and 192GB RAM as follows, but can be > provokes on systems with less available CPUs. > > * Disable swap > * Tune vm settings to promote regular writeback > $ echo 50 > /proc/sys/vm/dirty_expire_centisecs > $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs > $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes > > * Create 8 dmcrypt devices based on files on a tmpfs > * Create and mount an ext4 filesystem on each crypt devices > * Run stress-ng --hdd 8 within one of above filesystems > > Total %system usage shown via sysstat goes to ~35%, write througput on the > underlying loop device is ~2GB/s. perf profiling an individual kworker > kcryptd thread shows the following in the profile, indicating it hits > heavy spinlock contention in percpu_counter_compare(): > > 99.98% 0.00% kworker/u193:46 [kernel.kallsyms] [k] ret_from_fork > | > ---ret_from_fork > kthread > worker_thread > | > --99.92%--process_one_work > | > |--80.52%--kcryptd_crypt > | | > | |--62.58%--mempool_alloc > | | | > | | --62.24%--crypt_page_alloc > | | | > | | --61.51%--__percpu_counter_compare > | | | > | | --61.34%--__percpu_counter_sum > | | | > | | |--58.68%--_raw_spin_lock_irqsave > | | | | > | | | --58.30%--native_queued_spin_lock_slowpath > | | | > | | --0.69%--cpumask_next > | | | > | | --0.51%--_find_next_bit > | | > | |--10.61%--crypt_convert > | | | > | | |--6.05%--xts_crypt > ... > > After apply this change, %system usage is lowered to ~7% and > write throughput on the loopback interface increases to 2.7GB/s. > The profile shows mempool_alloc() as ~8% rather than ~62% in the > profile and not hitting the percpu_counter() spinlock anymore. > > |--8.15%--mempool_alloc > | | > | |--3.93%--crypt_page_alloc > | | | > | | --3.75%--__alloc_pages > | | | > | | --3.62%--get_page_from_freelist > | | | > | | --3.22%--rmqueue_bulk > | | | > | | --2.59%--_raw_spin_lock > | | > | | --2.57%--native_queued_spin_lock_slowpath > | | > | --3.05%--_raw_spin_lock_irqsave > | | > | --2.49%--native_queued_spin_lock_slowpath > > Suggested-by: DJ Gregor <dj@xxxxxxxxxxxxx> > Signed-off-by: Arne Welzel <arne.welzel@xxxxxxxxxxxxx> > --- > drivers/md/dm-crypt.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c > index 50f4cbd600d5..2ae481610f12 100644 > --- a/drivers/md/dm-crypt.c > +++ b/drivers/md/dm-crypt.c > @@ -2661,7 +2661,12 @@ static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data) > struct crypt_config *cc = pool_data; > struct page *page; > > - if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) && > + /* > + * Note, percpu_counter_read_positive() may over (and under) estimate > + * the current usage by at most (batch - 1) * num_online_cpus() pages, > + * but avoids potential spinlock contention of an exact result. > + */ > + if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) > dm_crypt_pages_per_client) && > likely(gfp_mask & __GFP_NORETRY)) > return NULL; > > -- > 2.20.1 > -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel