Re: kernel behaviour, was Re: dash behaviour

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Finn,

Am 10.04.2023 um 21:39 schrieb Finn Thain:
On Mon, 10 Apr 2023, Michael Schmitz wrote:


So I guess this bug has more to do with timing and little to do with
state, contrary to my guesswork above. And no doubt I will have to

What may still vary is physical mapping - I remember you had used some
tool before to parse proc/<pid>/pagemap to determine the physical
addresses for task stack areas? Or am I misremembering that from some
other bug?


You're right, back in September 2021 when I was chasing a different bug we
did discuss tools to look at physical mappings. I don't think that would
help here though. We know the failure is not bad RAM because multiple Macs
fail in the same way. Also, there's no DMA taking place on these
particular machines.

contradict myself again if/when it turns out that uninitialized memory
is a factor :-/

I haven't found a config option to initialize memory returned by the
kernel page allocators, so not sure how to test that ...


I was able to find some command line options (init_on_alloc, init_on_free)
and the related Kconfig symbols (CONFIG_INIT_ON_ALLOC_DEFAULT_ON,
CONFIG_INIT_ON_FREE_DEFAULT_ON).

Right - not sure how I managed to miss those.

init_on_free might delay the boot process a while! But I would guesss init_on_alloc should be OK in the first instance.


Given the compiler supports -fzero-call-used-regs=used-gpr there's also
CONFIG_ZERO_CALL_USED_REGS. Also CONFIG_INIT_STACK_ALL_ZERO
(-ftrivial-auto-var-init=zero).

The problem with these options is that they may produce a large effect on
the timing of events but they should still have no effect on the behaviour
of a correct userspace program.

Since we are dealing with a suspect userspace program, what could we learn
from such a test? E.g. if the crashing stopped one could simply attribute

We don't know for definite that we deal with a suspect user space program - it might just be a change in a previously fine program that now exposes a subtle kernel bug (undetected for quite a long time, but we've seen a few of those now...)?

that to the timing change. I suppose, if the crashing became more
frequent, perhaps that would help debug the userspace program. So maybe
it's worth a try...

We'd then have to try and minimize the impact on timing, by instead initializing a 'shadow' page reserved for that purpose. Though I suspect the loop over the pages might be optimized away in that case. See include/linux/highmem.h:clear_highpage_kasan_tagged() and mm/page_alloc.c:kernel_init_pages() ...

Cheers,

	Michael





[Index of Archives]     [Video for Linux]     [Yosemite News]     [Linux S/390]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux