Re: [PATCH v2] mm: Make failslab writable again

Alexander Atanasov <alexander.atanasov@xxxxxxxxxxxxx> · Tue, 27 Sep 2022 10:44:20 +0300

Hello,

On 27.09.22 3:49, Hyeonggon Yoo wrote:
On Fri, Sep 23, 2022 at 10:34:28AM +0300, Alexander Atanasov wrote:
Hello,

On 21.09.22 14:30, Hyeonggon Yoo wrote:
On Tue, Sep 20, 2022 at 03:11:11PM +0300, Alexander Atanasov wrote:
In (060807f841ac mm, slub: make remaining slub_debug related attributes
read-only) failslab was made read-only.
I think it became a collateral victim to the two other options for which
the reasons are perfectly valid.
Here is why:
   - sanity_checks and trace are slab internal debug options,
     failslab is used for fault injection.
   - for fault injections, which by presumption are random, it
     does not matter if it is not set atomically. And you need to
     set atleast one more option to trigger fault injection.
   - in a testing scenario you may need to change it at runtime
     example: module loading - you test all allocations limited
     by the space option. Then you move to test only your module's
     own slabs.
   - when set by command line flags it effectively disables all
     cache merges.

Maybe we can make failslab= boot parameter to consider cache filtering?

With that, just pass something like this:
	failslab=X,X,X,X,cache_filter slub_debug=A,<cache-name>>

Users should pass slub_debug=A,<cache-name> anyway to prevent cache merging.

It will be good to have this in case you want to test cache that is used
early. But why push something to command line option only when it can be
changed at runtime?

Hmm okay. I'm not against changing it writable. (it looks okay to me.)

Okay. Good to know that.

Just wanted to understand your use case!
Can you please elaborate why booting with slub_debug=A,<your cache name>
and enabling cache_filter after boot does not work?

I didn't say it does not work - it does work but requires reboot. You 
may want to test variations of caches for example. Cache A, Cache B ... 
C and so on one by one. Reboots might be fast these days with VMs but 
you may not be able to test everything in a VM. And ... reboots used to 
be the signature move of one Other OS.

Or is it trying to changnig these steps,

FROM
	1. booting with slub_debug=A,<cache name>
	2. write to cache_filter to enable cache filtering
	3. setup probability, interval, times, size

TO

	1. write to failslab attribute of <cache name> (may fail it has alias)
	2. write to cache_filter to enable cache filtering
	3. setup probability, interval, times, size
?

as you may know, SLAB_FAILSLAB does nothing whens
cache_filter is disabled, and you should pass slub_debug=A,<cache name> anyway

Okay , i think there awaits another problem:
bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags)
{
...

        if (failslab.cache_filter && !(s->flags & SLAB_FAILSLAB))
                return false;
...
	return should_fail(&failslab.attr, s->object_size);
}

So if you do not have cache_filter set ... you go to should_fail for all 
slabs.
I've been hit by that and spend a lot of time trying to understand why i 
got crashes at random places. And the reason was that i read an old 
documentation that said cache_filter is writable and i blindly wrote 1 
to it. If the intent is to only work with cache filter set - then i will 
update the patch to do so. This is the only place where SLAB_FAILSLAB is 
explicitly tested, other places check it as part of SLAB_NEVER_MERGE.

But even for all caches it is kind of possible to test with size(space) 
which is in turn useful because you need to figure out how you handle 
failures from external caches - external to your code under test and you 
don't want to keep track for all of them (same goes for too much options 
in command line).

to prevent doing cache merging with <cache name>.

Or you can pass SLAB_FAILSLAB from your module when creating the cache 
to prevent merge when under test.

--
Regards,
Alexander Atanasov