[BUG REPORT][ARM] Compacting glibc code pages causes random process crashes in user space (SIGILL, SIGSEGV)

"Schmitz, Christoph" <christoph.schmitz@xxxxxxx> · Thu, 14 Dec 2023 01:07:09 +0000

Hi, I am part of a team of Linux developers at HPE who work on various embedded boards one of which is based on Freescale's IMX6 SOC design. 
Our IMX6DL is a two core ARM solution (ARMv7 / ARM Cortex A9) which runs at 1.2GHz. We have 1GB of main memory. 

We have puzzled over this problem for many months now, so we are desperate to develop a "clean" solution (not based on avoidance strategies). We are not 100% sure that the problem does not reside elsewhere, so we must consult with the subject matter experts.

It seems that we have uncovered a race in mm/compaction.c and mm/migrate.c code which can cause random crashes in user space applications rather indiscriminately and with detectable probability.

Due to heightened focus on security concerns, we were starting to upgrade the kernel 
from our previous builds (based on 4.9.11 - https://github.com/Freescale/linux-fslc/tree/4.9-2.3.x-imx) 
to a much more contemporary build (based on 5.15.77 - https://github.com/Freescale/linux-fslc/tree/5.15.x+fslc).

This is where the trouble started: During weekly regression tests, we observed at least 2 core dumps in every run. Over many weeks, it became apparent that these were no ordinary stability issues:
* Core dumps affected LOTS of different processes. Open source processes such as gawk, python or apache (normally super stable) were affected.
* Core dumps were due to both SIGSEGV (80%) or SIGILL (20%).
* Crashes tended to affect processes that were scheduled often ("CPU hogs") and seem to prefer the ones that were scheduled with elevated priorities (e.g. corosync - nice -20). Some processes even use SCHED_FIFO, prio 90 at times (e.g. proprietary broadcom daemon). 
* The core dumps were hard to analyze:
                - Stack content was often corrupted making unwinding impossible. Including frame pointer helped a little bit, but not always.
                - SIGILL never revealed any true illegal instructions - the code always looked OK in the core files (and in line with what we compiled).
                - Crash sites were varied. Only common denominator was that they appeared in library code and tended to cluster around blocking synchronizing primitives (e.g. pthread_mutex_lock, pthread_cond_wait, etc.)

We had noticed before that turning kernel tracing ON would aggravate the core dumps, but tracing gave us novel insight into what was going on right before the fatal signal was generated. We noticed that kcompactd0 was ALWAYS running right before a core dump was observed.

As an experiment, we turned compaction off (CONFIG_COMPACTION=n) and that FIXED the issue. Our stress test (firmware upgrade) would usually reproduce a core dump every 20 iterations or so, but now we ran 1500+ iterations without any issues. This, of course, is not recommended ... (Avoidance Strategy 1)

We started investigating why this was never an issue in 4.9.11 (previous kernel) before. We noticed two areas that had changed.
* New in Kernel 5.15 "proactive compaction"
* New in Kernel 4.20 "watermark_boost_factor": This feature seemed to always provoke a huge compaction step of order 13 (pageblock_order) in our architecture.
                vmscan.c     balance_pgdat       4065 wakeup_kcompactd(pgdat, pageblock_order, highest_zoneidx);
                commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac

We were able to prove that tuning down compaction_proactiveness=0, watermark_boost_factor=0 would fix the issue for us as well (Avoidance Strategy 2)
None of these features existed in 4.9.11 ! This explains why compaction has never been an issue before (even though enabled in 4.9.11).

We also tried to root cause the migration process:
* The dying process always ran in parallel with kcompactd (sometimes on the same core (context switch), but most often on the alternate core).
* The dying process was always executing code pages in glibc which were migrated just split seconds and a few migration steps before.

Locking glibc memory (via mlockall()) and forbidding migration of locked pages fixes the issue as well (Avoidance Strategy 3)

Additional experiments:
* Tried to disable core 1 temporarily (cpu_remove/cpu_add) and pin kcompactd0 to boot core 0. This is unfortunately not viable for us, since we have real time processes running with tight scheduling constraints (corosync). Running kcompactd0 exclusively for many 100s of milliseconds is not possible.
* Tried to invalidate cache page/TLB very explicitly - I noticed that for our architecture update_mmu_cache is a NOOP. Added flush_cache_page() ... flush_tlb_page() for each "remove_migration_pte" step. This did not help (possibly not a cache coherency issue - this was my pet theory based on https://gitlab.eclipse.org/eclipse/oniro-core/linux/-/commit/4774a369518091f46435e0539de6a45bf0681c74).

Any help, reply or tip would be greatly appreciated!

Christoph Schmitz
Firmware Engineer
Hewlett Packard Enterprises