* Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > On a 2-socket Cascade Lake test machine, the time to complete the > workload is as follows; > > 6.6.0-rc2 6.6.0-rc2 > sched-numabtrace-v1 sched-numabselective-v1 > Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) > Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* > Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) > CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) > Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) > > The time to complete the workload is reduced by almost 30% > > 6.6.0-rc2 6.6.0-rc2 > sched-numabtrace-v1 sched-numabselective-v1 / > Duration User 91201.80 63506.64 > Duration System 2015.53 1819.78 > Duration Elapsed 1234.77 868.37 > > In this specific case, system CPU time was not increased but it's not > universally true. > > From vmstat, the NUMA scanning and fault activity is as follows; > > 6.6.0-rc2 6.6.0-rc2 > sched-numabtrace-v1 sched-numabselective-v1 > Ops NUMA base-page range updates 64272.00 26374386.00 > Ops NUMA PTE updates 36624.00 55538.00 > Ops NUMA PMD updates 54.00 51404.00 > Ops NUMA hint faults 15504.00 75786.00 > Ops NUMA hint local faults % 14860.00 56763.00 > Ops NUMA hint local percent 95.85 74.90 > Ops NUMA pages migrated 1629.00 6469222.00 > > Both the number of PTE updates and hint faults is dramatically > increased. While this is superficially unfortunate, it represents > ranges that were simply skipped without the patch. As a result > of the scanning and hinting faults, many more pages were also > migrated but as the time to completion is reduced, the overhead > is offset by the gain. Nice! I've applied your series to tip:sched/core with a few non-functional edits to comment/changelog formatting/clarity. Btw., was any previous analysis done on the size of the pids_active[] hash and the hash collision rate? 64 (BITS_PER_LONG) feels a bit small, especially on larger machines running threaded workloads, and the kmalloc of numab_state likely allocates a full cacheline anyway, so we could double the hash size from 8 bytes (2x1 longs) to 32 bytes (2x2 longs) with very little real cost, and still have a long field left to spare? Thanks, Ingo