On Wed, May 24, 2023 at 6:01 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Wed, May 17, 2023 at 11:11:51AM -0700, Nick Desaulniers wrote: > > On Wed, May 17, 2023 at 8:21 AM Naresh Kamboju > > <naresh.kamboju@xxxxxxxxxx> wrote: > > > > > > Linux next-20230517 build with clang nightly for i386 boot fails intermittently. > > > > Keyword: intermittently. That will make tracking this down fun. > > > > Our CI also hit a boot failure on tip/master with the same splat: > > https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/4998374271/jobs/8957285746 > > Though the CI pulled down a SHA > > 0932447780e1f9a43bf68ef7fe3d9b41b46d58fc > > which looks weird on > > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=0932447780e1f9a43bf68ef7fe3d9b41b46d58fc > > >> Notice: this object is not reachable from any branch. > > Github isn't willing to show me content unless I log in or somesuch > nonsense. Ah, sorry about that. https://paste.debian.net/1281050/ should be the log of ours. https://storage.tuxsuite.com/public/clangbuiltlinux/continuous-integration2/builds/2QEtkwi60Mn3NLX4U0sDCAH0qqp/bzImage is the corresponding build artifact. There's ongoing discussion in #x86 on LinuxNet. I suspect that a few of Naresh's recent reports are all perhaps one single issue. Arnd mentioned https://lore.kernel.org/all/CA+G9fYvVZ9WF-2zfrYeo3xnWNra0QGxLzei+b4yANZwEvr5CYw@xxxxxxxxxxxxxx/ which looks similar but is with GCC. Either way, we're seeing this in mainline. > > > That this failed in -next and -tip in the same way makes me wonder if > > something affecting this is coming in via -tip? Maybe the splat looks > > familiar to x86 folks? > > > > I haven't been able to reproduce locally when my machine is relatively > > load-less. If I do a kernel build in the background, I was able to > > get QEMU to hang, but without any splat. That was using tip/master @ > > f81d8f759e7f. > > > > Naresh, when you say "intermittent" do you have any data on the > > relative frequency of this boot failure? (Also, please make sure to > > use llvm@xxxxxxxxxxxxxxx in the future; we moved mailing lists years > > ago). > > > > Looks like our CI report linked above has an additional splat though > > via apply_alternatives and optimize_nops. > > > > >> [ 0.166742] Code: Unable to access opcode bytes at 0x36. > > > > Peter, that smells like perhaps either: > > commit b6c881b248ef ("x86/alternative: Complicate optimize_nops() some more") > > commit 6c480f222128 ("x86/alternative: Rewrite optimize_nops() some") > > So I did find me a 'funny' there, but nothing that explains boot fail. > > It would think that 'PAUSE' is a 2 byte NOP and replace it with NOP2; > which is not quite the same thing. The below seems to cure that. > > Let me continue poking at things... > > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > index 93aa95afd005..bb0a7b03e52f 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -159,9 +160,12 @@ void text_poke_early(void *addr, const void *opcode, size_t len); > */ > static bool insn_is_nop(struct insn *insn) > { > - if (insn->opcode.bytes[0] == 0x90) > + /* Anything NOP, but not REP NOP. */ > + if (insn->opcode.bytes[0] == 0x90 && > + (!insn->prefixes.nbytes || insn->prefixes.bytes[0] != 0xF3)) > return true; > > + /* NOPL */ > if (insn->opcode.bytes[0] == 0x0F && insn->opcode.bytes[1] == 0x1F) > return true; > -- Thanks, ~Nick Desaulniers