On Tue, Sep 11, 2018 at 01:36:07PM +0800, Aaron Lu wrote: > Daniel Jordan and others proposed an innovative technique to make > multiple threads concurrently use list_del() at any position of the > list and list_add() at head position of the list without taking a lock > in this year's MM summit[0]. > > People think this technique may be useful to improve zone lock > scalability so here is my try. Nice, this uses the smp_list_* functions well in spite of the limitations you encountered with them here. > Performance wise on 56 cores/112 threads Intel Skylake 2 sockets server > using will-it-scale/page_fault1 process mode(higher is better): > > kernel performance zone lock contention > patch1 9219349 76.99% > patch7 2461133 -73.3% 54.46%(another 34.66% on smp_list_add()) > patch8 11712766 +27.0% 68.14% > patch9 11386980 +23.5% 67.18% Is "zone lock contention" the percentage that readers and writers combined spent waiting? I'm curious to see read and write wait time broken out, since it seems there are writers (very likely on the allocation side) spoiling the parallelism we get with the read lock. If the contention is from allocation, I wonder whether it's feasible to make that path use SMP list functions. Something like smp_list_cut_position combined with using page clusters from [*] to cut off a chunk of list. Many details to keep in mind there, though, like having to unset PageBuddy in that list chunk when other tasks can be concurrently merging pages that are part of it. Or maybe what's needed is a more scalable data structure than an array of lists, since contention on the heads seems to be the limiting factor. A simple list that keeps the pages in most-recently-used order (except when adding to the list tail) is good for cache warmth, but I wonder how helpful that is when all CPUs can allocate from the front. Having multiple places to put pages of a given order/mt would ease the contention. > Though lock contention reduced a lot for patch7, the performance dropped > considerably due to severe cache bouncing on free list head among > multiple threads doing page free at the same time, because every page free > will need to add the page to the free list head. Could be beneficial to take an MCS-style approach in smp_list_splice/add so that multiple waiters aren't bouncing the same cacheline around. This is something I'm planning to try on lru_lock. Daniel [*] https://lkml.kernel.org/r/20180509085450.3524-1-aaron.lu@xxxxxxxxx