Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible

Yosry Ahmed <yosry.ahmed@xxxxxxxxx> · Wed, 12 Feb 2025 15:35:59 +0000

On Wed, Feb 12, 2025 at 02:00:26PM +0900, Sergey Senozhatsky wrote:
> On (25/02/07 21:09), Yosry Ahmed wrote:
> > Can we do some perf testing to make sure this custom locking is not
> > regressing performance (selfishly I'd like some zswap testing too)?
> 
> So for zsmalloc I (usually) write some simple testing code which is
> triggered via sysfs (device attr) and that is completely reproducible,
> so that I compares apples to apples.  In this particular case I just
> have a loop that creates objects (we don't need to compress or decompress
> anything, zsmalloc doesn't really care)
> 
> -	echo 1 > /sys/ ... / test_prepare
> 
> 	for (sz = 32; sz < PAGE_SIZE; sz += 64) {
> 		for (i = 0; i < 4096; i++) {
> 			ent->handle = zs_malloc(zram->mem_pool, sz)
> 			list_add(ent)
> 		}
> 	}
> 
> 
> And now I just `perf stat` writes:
> 
> -	perf stat echo 1 > /sys/ ... / test_exec_old
> 
> 	list_for_each_entry
> 		zs_map_object(ent->handle, ZS_MM_RO);
> 		zs_unmap_object(ent->handle)
> 
> 	list_for_each_entry
> 		dst = zs_map_object(ent->handle, ZS_MM_WO);
> 		memcpy(dst, tmpbuf, ent->sz)
> 		zs_unmap_object(ent->handle)
> 
> 
> 
> -	perf stat echo 1 > /sys/ ... / test_exec_new
> 
> 	list_for_each_entry
> 		dst = zs_obj_read_begin(ent->handle, loc);
> 		zs_obj_read_end(ent->handle, dst);
> 
> 	list_for_each_entry
> 		zs_obj_write(ent->handle, tmpbuf, ent->sz);
> 
> 
> -	echo 1 > /sys/ ... / test_finish
> 
> 	free all handles and ent-s
> 
> 
> The nice part is that we don't depend on any of the upper layers, we
> don't even need to compress/decompress anything; we allocate objects
> of required sizes and memcpy static data there (zsmalloc doesn't have
> any opinion on that) and that's pretty much it.
> 
> 
> OLD API
> =======
> 
> 10 runs
> 
>        369,205,778      instructions                     #    0.80  insn per cycle            
>         40,467,926      branches                         #  113.732 M/sec                     
> 
>        369,002,122      instructions                     #    0.62  insn per cycle            
>         40,426,145      branches                         #  189.361 M/sec                     
> 
>        369,051,170      instructions                     #    0.45  insn per cycle            
>         40,434,677      branches                         #  157.574 M/sec                     
> 
>        369,014,522      instructions                     #    0.63  insn per cycle            
>         40,427,754      branches                         #  201.464 M/sec                     
> 
>        369,019,179      instructions                     #    0.64  insn per cycle            
>         40,429,327      branches                         #  198.321 M/sec                     
> 
>        368,973,095      instructions                     #    0.64  insn per cycle            
>         40,419,245      branches                         #  234.210 M/sec                     
> 
>        368,950,705      instructions                     #    0.64  insn per cycle            
>         40,414,305      branches                         #  231.460 M/sec                     
> 
>        369,041,288      instructions                     #    0.46  insn per cycle            
>         40,432,599      branches                         #  155.576 M/sec                     
> 
>        368,964,080      instructions                     #    0.67  insn per cycle            
>         40,417,025      branches                         #  245.665 M/sec                     
> 
>        369,036,706      instructions                     #    0.63  insn per cycle            
>         40,430,860      branches                         #  204.105 M/sec                     
> 
> 
> NEW API
> =======
> 
> 10 runs
> 
>        265,799,293      instructions                     #    0.51  insn per cycle            
>         29,834,567      branches                         #  170.281 M/sec                     
> 
>        265,765,970      instructions                     #    0.55  insn per cycle            
>         29,829,019      branches                         #  161.602 M/sec                     
> 
>        265,764,702      instructions                     #    0.51  insn per cycle            
>         29,828,015      branches                         #  189.677 M/sec                     
> 
>        265,836,506      instructions                     #    0.38  insn per cycle            
>         29,840,650      branches                         #  124.237 M/sec                     
> 
>        265,836,061      instructions                     #    0.36  insn per cycle            
>         29,842,285      branches                         #  137.670 M/sec                     
> 
>        265,887,080      instructions                     #    0.37  insn per cycle            
>         29,852,881      branches                         #  126.060 M/sec                     
> 
>        265,769,869      instructions                     #    0.57  insn per cycle            
>         29,829,873      branches                         #  210.157 M/sec                     
> 
>        265,803,732      instructions                     #    0.58  insn per cycle            
>         29,835,391      branches                         #  186.940 M/sec                     
> 
>        265,766,624      instructions                     #    0.58  insn per cycle            
>         29,827,537      branches                         #  212.609 M/sec                     
> 
>        265,843,597      instructions                     #    0.57  insn per cycle            
>         29,843,650      branches                         #  171.877 M/sec                     
> 
> 
> x old-api-insn
> + new-api-insn
> +-------------------------------------------------------------------------------------+
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |A                                                                                   A|
> +-------------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x  10  3.689507e+08 3.6920578e+08 3.6901918e+08 3.6902586e+08     71765.519
> +  10  2.657647e+08 2.6588708e+08 2.6580373e+08 2.6580734e+08     42187.024
> Difference at 95.0% confidence
> 	-1.03219e+08 +/- 55308.7
> 	-27.9705% +/- 0.0149878%
> 	(Student's t, pooled s = 58864.4)

Thanks for sharing these results, but I wonder if this will capture
regressions from locking changes (e.g. a lock being preemtible)? IIUC
this is counting the instructions executed in these paths, and that
won't change if the task gets preempted. Lock contention may be captured
as extra instructions, but I am not sure we'll directly see its effect
in terms of serialization and delays.

I think we also need some high level testing (e.g. concurrent
swapins/swapouts) to find that out. I think that's what Kairui's testing
covers.