On Mon, Oct 23, 2023 at 02:15:01PM +0200, Jan Kara wrote: > On Mon 23-10-23 14:45:05, Andy Shevchenko wrote: > > On Sat, Oct 21, 2023 at 04:36:19PM -0700, Kees Cook wrote: > > > On October 20, 2023 1:36:36 PM PDT, andy.shevchenko@xxxxxxxxx wrote: > > > >That said, if you or anyone has ideas how to debug futher, I'm all ears! > > > > > > I don't think this has been tried yet: > > > > > > When I've had these kind of hard-to-find glitches I've used manual > > > built-binary bisection. Assuming you have a source tree that works when built > > > with Clang and not with GCC: > > > - build the tree with Clang with, say, O=build-clang > > > - build the tree with GCC, O=build-gcc > > > - make a new tree for testing: cp -a build-clang build-test > > > - pick a suspect .o file (or files) to copy from build-gcc into build-test > > > - perform a relink: "make O=build-test" should DTRT since the copied-in .o > > > files should be newer than the .a and other targets > > > - test for failure, repeat > > > > > > Once you've isolated it to (hopefully) a single .o file, then comes the > > > byte-by-byte analysis or something similar... > > > > > > I hope that helps! These kinds of bugs are super frustrating. > > > > I'm sorry, but I can't see how this is not an error prone approach. > > If it's a timing issue then the arbitrary object change may help and it doesn't > > prove anything. As earlier I tried to comment out the error message, and it > > worked with GCC as well. The difference is so little (according to Linus) that > > it may not be suspectible. Maybe I am missing the point... > > Given how reliably you can hit the problem with some kernels while you > cannot hit them with others (only slightly different in a code that doesn't > even get executed on your system) I suspect this is really more a code > placement issue than a timing issue. Like if during the linking phase of > vmlinux some code ends up at some position, the kernel fails, otherwise it > boots fine. Not sure how to debug such thing though. Maybe some playing > with the linker and the order of object files linked could reveal something > but I'm just guessing. Right -- in theory there will be some minimum subset of "from GCC" objects that when used together in the otherwise "known good" build will trip the failure. -- Kees Cook