On Wed, Jul 07, 2021 at 11:40:45AM -0700, Linus Torvalds wrote: > On Wed, Jul 7, 2021 at 6:00 AM Dennis Zhou <dennis@xxxxxxxxxx> wrote: > > > > This is just a single change to fix percpu depopulation. The code relied > > on depopulation code written specifically for the free path and relied > > on vmalloc to do the tlb flush lazily. As we're modifying the backing > > pages during the lifetime of a chunk, we need to also flush the tlb > > accordingly. > > I pulled this, but I ended up unpulling after looking at the fix. > > The fix may be perfectly correct, but I'm looking at that > pcpu_reclaim_populated() function, and I want somebody to explain to > me what it's ok to drop and re-take the 'pcpu_lock' and just continue. > > Because whatever it was protecting is now not protected any more. > > It *looks* like it's intended to protect the pcpu_chunk_lists[] > content, and some other functions that do this look ok. So for > example, pcpu_balance_free() at least removes the 'chunk' from the > pcpu_chunk_lists[] before it drops the lock and then works on the > chunk contents. > > But pcpu_reclaim_populated() seems to *leave* chunk on the > pcpu_chunk_lists[], drop the lock, and then continue to use 'chunk'. > > That odd "release lock and continue to use the data it's supposed to > protect" seems to be pre-existing, but > > (a) this is the code that caused problems to begin with > > and > > (b) it seems to now happen even more. > > So maybe this code is right. But it looks very odd to me, and I'd like > to get more explanations of _why_ it would be ok before I pull this > fix, since there seems to be a deeper underlying problem in the code > that this tries to fix. > Yeah I agree. I think I've inadvertently made this more complex than it needs to be and intend to clean it up. Percpu has a mutex lock and spinlock. The mutex lock is responsible for lifetime of a chunk and correctness of the backing pages, chunk->populated[]. The spinlock protects the other members of the chunk as well as pcpu_chunk_lists[]. The pcpu_balance_workfn() is used to serialize the freeing of chunks and maintenance of a floating # of pages to suffice atomic allocations. This is where depopulation is being bolted onto. It uses the serialization of: mutex_lock(&pcpu_alloc_mutex); pcpu_balance_free() pcpu_reclaim_populated() pcpu_balance_populated() mutex_unlock(&pcpu_alloc_mutex); while holding the mutex lock to know that the chunk we are modifying will not be freed. It's this that makes it okay to just pick up the lock and continue. Depopulation adds complexity to the lifecycle of a chunk through others freeing back percpu allocations. So, unlike pcpu_balance_free(), taking the chunk off of pcpu_chunk_lists[] doesn't guarantee someone else isn't accessing the chunk because they are freeing an object back. To handle this, a notion of isolated chunks is introduced and these chunks become effectively hidden to the allocation path. We are also holding the mutex lock throughout this process which prevents the allocation path from simultaenously adding backing pages to a chunk. I hope this explanation makes sense. Thanks, Dennis