kernel behaviour, was Re: dash behaviour

Finn Thain <fthain@xxxxxxxxxxxxxx> · Mon, 10 Apr 2023 19:39:22 +1000 (AEST)

On Mon, 10 Apr 2023, Michael Schmitz wrote:

So I guess this bug has more to do with timing and little to do with 
state, contrary to my guesswork above. And no doubt I will have to

What may still vary is physical mapping - I remember you had used some 
tool before to parse proc/<pid>/pagemap to determine the physical 
addresses for task stack areas? Or am I misremembering that from some 
other bug?

You're right, back in September 2021 when I was chasing a different bug we 
did discuss tools to look at physical mappings. I don't think that would 
help here though. We know the failure is not bad RAM because multiple Macs 
fail in the same way. Also, there's no DMA taking place on these 
particular machines.

contradict myself again if/when it turns out that uninitialized memory 
is a factor :-/

I haven't found a config option to initialize memory returned by the 
kernel page allocators, so not sure how to test that ...

I was able to find some command line options (init_on_alloc, init_on_free) 
and the related Kconfig symbols (CONFIG_INIT_ON_ALLOC_DEFAULT_ON, 
CONFIG_INIT_ON_FREE_DEFAULT_ON).

Given the compiler supports -fzero-call-used-regs=used-gpr there's also 
CONFIG_ZERO_CALL_USED_REGS. Also CONFIG_INIT_STACK_ALL_ZERO 
(-ftrivial-auto-var-init=zero).

The problem with these options is that they may produce a large effect on 
the timing of events but they should still have no effect on the behaviour 
of a correct userspace program.

Since we are dealing with a suspect userspace program, what could we learn 
from such a test? E.g. if the crashing stopped one could simply attribute 
that to the timing change. I suppose, if the crashing became more 
frequent, perhaps that would help debug the userspace program. So maybe 
it's worth a try...