On Wed, 13 Mar 2024, Jianfeng Wang wrote:
I am not sure that the RCU change will solve the lockup problem. The reason is that iterating a super long list of partial slabs is a problem by itself, e.g., on a non-preemptive kernel, then count_partial() can be stuck in the loop for a while, which can cause problems. Also, even if we check the list ownership for slabs, we may spend too much time in the loop if no updater shows up, or fail and re-do many times the loop if several updates happen. The latter can exacerbate this lockup issue. So, in the end, reading /proc/slabinfo can take a super long time just for a counter that may be changing all the time.
Well we could also cache the values somehow to avoid the scans? invalidate the counter if something significant happens.
Thus, I prefer the "guesstimate" approach, even if the number is inaccurate or biased. Let me know if this makes sense.
Come up with a patch and then lets see how well it works.