Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible

Yosry Ahmed <yosry.ahmed@xxxxxxxxx> · Tue, 4 Feb 2025 17:19:28 +0000

On Tue, Feb 04, 2025 at 03:59:42PM +0900, Sergey Senozhatsky wrote:
> On (25/02/03 21:11), Yosry Ahmed wrote:
> > > > We also lose some debugging capabilities as Hilf pointed out in another
> > > > patch.
> > > 
> > > So that zspage lock should have not been a lock, I think, it's a ref-counter
> > > and it's being used as one
> > > 
> > > map()
> > > {
> > > 	page->users++;
> > > }
> > > 
> > > unmap()
> > > {
> > > 	page->users--;
> > > }
> > > 
> > > migrate()
> > > {
> > > 	if (!page->users)
> > > 		migrate_page();
> > > }
> > 
> > Hmm, but in this case we want migration to block new map/unmap
> > operations. So a vanilla refcount won't work.
> 
> Yeah, correct - migration needs negative values so that map would
> wait until it's positive (or zero).
> 
> > > > Just my 2c.
> > > 
> > > Perhaps we can sprinkle some lockdep on it.  For instance:
> > 
> > Honestly this looks like more reason to use existing lock primitives to
> > me. What are the candidates? I assume rw_semaphore, anything else?
> 
> Right, rwsem "was" the first choice.
> 
> > I guess the main reason you didn't use a rw_semaphore is the extra
> > memory usage.
> 
> sizeof(struct zs_page) change is one thing.  Another thing is that
> zspage->lock is taken from atomic sections, pretty much everywhere.
> compaction/migration write-lock it under pool rwlock and class spinlock,
> but both compaction and migration now EAGAIN if the lock is locked
> already, so that is sorted out.
> 
> The remaining problem is map(), which takes zspage read-lock under pool
> rwlock.  RFC series (which you hated with passion :P) converted all zsmalloc
> into preemptible ones because of this - zspage->lock is a nested leaf-lock,
> so it cannot schedule unless locks it's nested under permit it (needless to
> say neither rwlock nor spinlock permit it).

Hmm, so we want the lock to be preemtible, but we don't want to use an
existing preemtible lock because it may be held it from atomic context.

I think one problem here is that the lock you are introducing is a
spinning lock but the lock holder can be preempted. This is why spinning
locks do not allow preemption. Others waiting for the lock can spin
waiting for a process that is scheduled out.

For example, the compaction/migration code could be sleeping holding the
write lock, and a map() call would spin waiting for that sleeping task.

I wonder if there's a way to rework the locking instead to avoid the
nesting. It seems like sometimes we lock the zspage with the pool lock
held, sometimes with the class lock held, and sometimes with no lock
held.

What are the rules here for acquiring the zspage lock? Do we need to
hold another lock just to make sure the zspage does not go away from
under us? Can we use RCU or something similar to do that instead?

> 
> > Seems like it uses ~32 bytes more than rwlock_t on x86_64.
> > That's per zspage. Depending on how many compressed pages we have
> > per-zspage this may not be too bad.
> 
> So on a 16GB laptop our memory pressure test at peak used approx 1M zspages.
> That is 32 bytes * 1M ~ 32MB of extra memory use.  Not alarmingly a lot,
> less than what a single browser tab needs nowadays.  I suppose on 4GB/8GB
> that will be even smaller (because those device generate less zspages).
> Numbers are not the main issue, however.
>