On (25/02/05 19:06), Yosry Ahmed wrote: > > > For example, the compaction/migration code could be sleeping holding the > > > write lock, and a map() call would spin waiting for that sleeping task. > > > > write-lock holders cannot sleep, that's the key part. > > > > So the rules are: > > > > 1) writer cannot sleep > > - migration/compaction runs in atomic context and grabs > > write-lock only from atomic context > > - write-locking function disables preemption before lock(), just to be > > safe, and enables it after unlock() > > > > 2) writer does not spin waiting > > - that's why there is only write_try_lock function > > - compaction and migration bail out when they cannot lock the > > zspage > > > > 3) readers can sleep and can spin waiting for a lock > > - other (even preempted) readers don't block new readers > > - writers don't sleep, they always unlock > > That's useful, thanks. If we go with custom locking we need to document > this clearly and add debug checks where possible. Sure. That's what it currently looks like (can always improve) --- /* * zspage lock permits preemption on the reader-side (there can be multiple * readers). Writers (exclusive zspage ownership), on the other hand, are * always run in atomic context and cannot spin waiting for a (potentially * preempted) reader to unlock zspage. This, basically, means that writers * can only call write-try-lock and must bail out if it didn't succeed. * * At the same time, writers cannot reschedule under zspage write-lock, * so readers can spin waiting for the writer to unlock zspage. */ static void zspage_read_lock(struct zspage *zspage) { atomic_t *lock = &zspage->lock; int old = atomic_read_acquire(lock); do { if (old == ZS_PAGE_WRLOCKED) { cpu_relax(); old = atomic_read_acquire(lock); continue; } } while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1)); #ifdef CONFIG_DEBUG_LOCK_ALLOC rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_); #endif } static void zspage_read_unlock(struct zspage *zspage) { atomic_dec_return_release(&zspage->lock); #ifdef CONFIG_DEBUG_LOCK_ALLOC rwsem_release(&zspage->lockdep_map, _RET_IP_); #endif } static bool zspage_try_write_lock(struct zspage *zspage) { atomic_t *lock = &zspage->lock; int old = ZS_PAGE_UNLOCKED; preempt_disable(); if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) { #ifdef CONFIG_DEBUG_LOCK_ALLOC rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_); #endif return true; } preempt_enable(); return false; } static void zspage_write_unlock(struct zspage *zspage) { atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED); #ifdef CONFIG_DEBUG_LOCK_ALLOC rwsem_release(&zspage->lockdep_map, _RET_IP_); #endif preempt_enable(); } --- Maybe I'll just copy-paste the locking rules list, a list is always cleaner. > > > I wonder if there's a way to rework the locking instead to avoid the > > > nesting. It seems like sometimes we lock the zspage with the pool lock > > > held, sometimes with the class lock held, and sometimes with no lock > > > held. > > > > > > What are the rules here for acquiring the zspage lock? > > > > Most of that code is not written by me, but I think the rule is to disable > > "migration" be it via pool lock or class lock. > > It seems like we're not holding either of these locks in > async_free_zspage() when we call lock_zspage(). Is it safe for a > different reason? I think we hold size class lock there. async-free is only for pages that reached 0 usage ratio (empty fullness group), so they don't hold any objects any more and from her such zspages either get freed or find_get_zspage() recovers them from fullness 0 and allocates an object. Both are synchronized by size class lock. > > Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data > > patterns the clients have. I suspect we'd need to synchronize RCU every > > time a zspage is freed: zs_free() [this one is complicated], or migration, > > or compaction? Sounds like anti-pattern for RCU? > > Can't we use kfree_rcu() instead of synchronizing? Not sure if this > would still be an antipattern tbh. Yeah, I don't know. The last time I wrongly used kfree_rcu() it caused a 27% performance drop (some internal code). This zspage thingy maybe will be better, but still has a potential to generate high numbers of RCU calls, depends on the clients. Probably the chances are too high. Apart from that, kvfree_rcu() can sleep, as far as I understand, so zram might have some extra things to deal with, namely slot-free notifications which can be called from softirq, and always called under spinlock: mm slot-free -> zram slot-free -> zs_free -> empty zspage -> kfree_rcu > It just seems like the current locking scheme is really complicated :/ That's very true.