Re: Oops or bad page in page_alloc.c

Yimin Deng <yimin11.deng@xxxxxxxxx> · Thu, 12 May 2022 11:55:49 +0800

Hi Sebastian,

Thanks a lot for your quick reply!

CONFIG_HAVE_PREEMPT_LAZY=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT__LL is not set
# CONFIG_PREEMPT_RTB is not set
# CONFIG_PREEMPT_RT_FULL is not set

CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set

CONFIG_PREEMPT_RT_FULL is not enabled, neither CONFIG_SLUB. I think
it's not related to the issue fixed in f1aca90802af9 ("Revert "slub:
delay ctor until the object is requested""). We share the kernel
source code but using different configuration on different products.
The applications on this product are non-RT applications.

This issue was reported on different nodes, so it seems not related to
hardware bad RAM. I'm checking whether it's possible for other CPUs in
AMP to overwrite the memory.

I will consider your suggestion on disabling the memory compacting and
enabling the list-debugging.

Sincerely appreciate your support!

B.R.
Yimin

Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> 于2022年5月12日周四 00:18写道：
>
> On 2022-05-09 15:40:43 [+0800], Yimin Deng wrote:
> > Hi
> Hi,
>
> > I encountered an oops in isolate_pcp_pages() and a bad page in
> > get_page_from_freelist().
> >
> > linux: 3.12.37-rt51  (CONFIG_PREEMPT_RT_BASE not enabled)
> > arch: PowperPC (e500)
> …
> What you mean by CONFIG_PREEMPT_RT_BASE is not enabled? Is
> CONFIG_PREEMPT_RT_FULL enabled or none of those options?
>
> > Any suggestions will be appreciated!
> >
> > [18857088.953420] Unable to handle kernel paging request for data at
> > address 0x00100104
> > [18857089.046143] Faulting instruction address: 0xc0075624
> …
> > [18857090.073578] NIP [c0075624] isolate_pcp_pages+0x84/0xc4
> > [18857090.138173] LR [c0078f24] free_hot_cold_page+0x124/0x174
> …
>
> I can't even tell if I saw a report as yours earlier or not. I do
> remember that I saw the "bad page state" reports earlier but I don't
> remember how they went away. I know that I had two 8572DS systems and
> one started to report all kind different errors (including "bad page
> state") but this was due to bad RAM (probably) since the other system
> never had this error despite that they had the same configuration.
>
> Your kernel is kind of old. The latest v3.12 is v3.12.74-rt99 which
> contains a few bug fixes including commit
>     f1aca90802af9 ("Revert "slub: delay ctor until the object is requested"")
>
> which is probably not what you see but a possible crash.
> You could disable memory compacting and so on but as far as I remember
> they could lead higher latencies in some cases, not to a crash.
> You could enable list-debugging in case an entry is added/removed
> multiple times.
> The e500 support is quite good upstream so you could upgrade to a later
> kernel (one of the current LTS kernels).
>
> > B.R.
> > Yimin
>
> Sebastian