On Thu, Oct 21, 2021 at 04:50:55PM -0700, Luis Chamberlain wrote: > On Wed, Oct 20, 2021 at 09:55:46AM +0800, Ming Lei wrote: > > When zram module is started to be unloaded, no one should use all zram > > disks at that time. But disk's show() or store() attribute method may be > > running, this way is expected because device_del()(called from > > del_gendisk) will drain all pending show()/store(). > > How about: > > ---------- > When the zram module is being unloaded, no one should be using the > zram disks. However even while being unloaded the zram module's > sysfs attributes might be poked at to re-configure the zram module. 're-configure the zram module' is confused, and it should be zram device. > This is expected, and kernfs ensures that these operations complete > before device_del() completes. Otherwise, the above is pretty good, and I will take that in V3. > ---------- > > > But reset_store() may set ->claim which will fail zram_remove(), when > > this happens, the zram device will be leaked and the warning of 'Error: > > Removing state 63 which has instances left.' is triggered during > > unloading module. > > Note: the "Removing state 63 which has instances left" does not > necessarily mean the struct zram is leaked. It just means that the per > cpu struct zcomp instances are leaked, given the CPU hotplug removal > callback is in charge of cleaning them up. That this gave us a hint that > we do in fact leak a struct zram device as well is separate from what > the message means, but I do agree it should be *that* because as you > noted, LTP does not yet try to make spaghetti with hot_add_show(). > > So, how about: > > ---------- > When the reset sysfs op (reset_store()) gets called the zram->claim will > be set, and this prevents zram_remove() from removing a zram device. If one > is unloading the module and one races to run the reset sysfs op we end > up leaking the struct zram device. We learned about this issue through > the error "Error: Removing state 63 which has instances left". While > this just means the any of the per cpu struct zcomp instances are > leaked when we're removing the CPU hotplug multistate callback, the > test case (running LTP zram02.sh twice in different terminals) that > allows us to reproduce this issue only mucks with reseting the devices, > not hot_add_show(). Setting zram->claim will prevent zram_remove() from > completing successfully, and so on module removal zram_remove_cb() does > not tell us when it failed to remove the full struct zram device and > it leaks. > ---------- commit log is for helping people to understand the change, and too long or exposing too much details may not serve that purpose since we also talk with code, and I believe the following words are clear enough for explaining the problem, and it is short & straightforward, and won't make people terrified, :-) But reset_store() may set ->claim which will fail zram_remove(), when this happens, zram_reset_device() is bypassed, and zram->comp can't be destroyed, so the warning of 'Error: Removing state 63 which has instances left.' is triggered during unloading module. > > This begs the question, should we not then also make zram_remove_cb() > chatty on failure? When zram_remove() is run from module unloading path, it shouldn't be failed cause no one owns any zram disk since unloading module, that is why I add WARN_ON_ONCE() in zram_remove_cb() for document benefit because zram_remove() is called from two contexts. But it can be failed in case of hot removing. > > > Fixes the issue by not failing zram_remove() if ->claim is set, and > > we actually need to do nothing in case that zram_reset() is running > > since del_gendisk() will wait until zram_reset() is done. > > Sure this all looks like a reasonable alternative to the issue without > using a generic lock. It does beg the questions if this suffices for > the other oddball sysfs ops, but we can take that up with time. ->claim is only set in reset_store(), so it is enough for avoiding zram_remove failure during unloading module. > > > Reported-by: Luis Chamberlain <mcgrof@xxxxxxxxxx> > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > In so far as the code is concerned: > > Reviewed-by: Luis Chamberlain <mcgrof@xxxxxxxxxx> Thanks for the review! -- Ming