On Wed, Feb 05, 2025 at 11:43:16AM +0900, Sergey Senozhatsky wrote: > On (25/02/04 17:19), Yosry Ahmed wrote: > > > sizeof(struct zs_page) change is one thing. Another thing is that > > > zspage->lock is taken from atomic sections, pretty much everywhere. > > > compaction/migration write-lock it under pool rwlock and class spinlock, > > > but both compaction and migration now EAGAIN if the lock is locked > > > already, so that is sorted out. > > > > > > The remaining problem is map(), which takes zspage read-lock under pool > > > rwlock. RFC series (which you hated with passion :P) converted all zsmalloc > > > into preemptible ones because of this - zspage->lock is a nested leaf-lock, > > > so it cannot schedule unless locks it's nested under permit it (needless to > > > say neither rwlock nor spinlock permit it). > > > > Hmm, so we want the lock to be preemtible, but we don't want to use an > > existing preemtible lock because it may be held it from atomic context. > > > > I think one problem here is that the lock you are introducing is a > > spinning lock but the lock holder can be preempted. This is why spinning > > locks do not allow preemption. Others waiting for the lock can spin > > waiting for a process that is scheduled out. > > > > For example, the compaction/migration code could be sleeping holding the > > write lock, and a map() call would spin waiting for that sleeping task. > > write-lock holders cannot sleep, that's the key part. > > So the rules are: > > 1) writer cannot sleep > - migration/compaction runs in atomic context and grabs > write-lock only from atomic context > - write-locking function disables preemption before lock(), just to be > safe, and enables it after unlock() > > 2) writer does not spin waiting > - that's why there is only write_try_lock function > - compaction and migration bail out when they cannot lock the > zspage > > 3) readers can sleep and can spin waiting for a lock > - other (even preempted) readers don't block new readers > - writers don't sleep, they always unlock That's useful, thanks. If we go with custom locking we need to document this clearly and add debug checks where possible. > > > I wonder if there's a way to rework the locking instead to avoid the > > nesting. It seems like sometimes we lock the zspage with the pool lock > > held, sometimes with the class lock held, and sometimes with no lock > > held. > > > > What are the rules here for acquiring the zspage lock? > > Most of that code is not written by me, but I think the rule is to disable > "migration" be it via pool lock or class lock. It seems like we're not holding either of these locks in async_free_zspage() when we call lock_zspage(). Is it safe for a different reason? > > > Do we need to hold another lock just to make sure the zspage does not go > > away from under us? > > Yes, the page cannot go away via "normal" path: > zs_free(last object) -> zspage becomes empty -> free zspage > > so when we have active mapping() it's only migration and compaction > that can free zspage (its content is migrated and so it becomes empty). > > > Can we use RCU or something similar to do that instead? > > Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data > patterns the clients have. I suspect we'd need to synchronize RCU every > time a zspage is freed: zs_free() [this one is complicated], or migration, > or compaction? Sounds like anti-pattern for RCU? Can't we use kfree_rcu() instead of synchronizing? Not sure if this would still be an antipattern tbh. It just seems like the current locking scheme is really complicated :/