Re: next/master boot bisection: next-20190215 on beaglebone-black

Dan Williams <dan.j.williams@xxxxxxxxx> · Fri, 1 Mar 2019 15:23:58 -0800

On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
<guillaume.tucker@xxxxxxxxxxxxx> wrote:
>
> On 01/03/2019 20:41, Andrew Morton wrote:
> > On Fri, 1 Mar 2019 09:25:24 +0100 Guillaume Tucker <guillaume.tucker@xxxxxxxxxxxxx> wrote:
> >
> >>>>> Michal had asked if the free space accounting fix up addressed this
> >>>>> boot regression? I was awaiting word on that.
> >>>>
> >>>> hm, does bot@xxxxxxxxxxxx actually read emails?  Let's try info@ as well..
> >>
> >> bot@xxxxxxxxxxxx is not person, it's a send-only account for
> >> automated reports.  So no, it doesn't read emails.
> >>
> >> I guess the tricky point here is that the authors of the commits
> >> found by bisections may not always have the hardware needed to
> >> reproduce the problem.  So it needs to be dealt with on a
> >> case-by-case basis: sometimes they do have the hardware,
> >> sometimes someone else on the list or on CC does, and sometimes
> >> it's better for the people who have access to the test lab which
> >> ran the KernelCI test to deal with it.
> >>
> >> This case seems to fall into the last category.  As I have access
> >> to the Collabora lab, I can do some quick checks to confirm
> >> whether the proposed patch does fix the issue.  I hadn't realised
> >> that someone was waiting for this to happen, especially as the
> >> BeagleBone Black is a very common platform.  Sorry about that,
> >> I'll take a look today.
> >>
> >> It may be a nice feature to be able to give access to the
> >> KernelCI test infrastructure to anyone who wants to debug an
> >> issue reported by KernelCI or verify a fix, so they won't need to
> >> have the hardware locally.  Something to think about for the
> >> future.
> >
> > Thanks, that all sounds good.
> >
> >>>> Is it possible to determine whether this regression is still present in
> >>>> current linux-next?
> >>
> >> I'll try to re-apply the patch that caused the issue, then see if
> >> the suggested change fixes it.  As far as the current linux-next
> >> master branch is concerned, KernelCI boot tests are passing fine
> >> on that platform.
> >
> > They would, because I dropped
> > mm-shuffle-default-enable-all-shuffling.patch, so your tests presumably
> > now have shuffling disabled.
> >
> > Is it possible to add the below to linux-next and try again?
>
> I've actually already done that, and essentially the issue can
> still be reproduced by applying that patch.  See this branch:
>
>   https://gitlab.collabora.com/gtucker/linux/commits/next-20190301-beaglebone-black-debug
>
> next-20190301 boots fine but the head fails, using
> multi_v7_defconfig + SMP=n in both cases and
> SHUFFLE_PAGE_ALLOCATOR=y enabled in the 2nd case as a result
> of the change in the default value.
>
> The change suggested by Michal Hocko on Feb 15th has now been
> applied in linux-next, it's part of this commit but as
> explained above it does not actually resolve the boot failure:
>
>   98cf198ee8ce mm: move buddy list manipulations into helpers
>
> I can send more details on Monday and do a bit of debugging to
> help narrowing down the problem.  Please let me know if
> there's anything in particular that would seem be worth
> trying.
>

Thanks for taking a look!

Some questions when you get a chance:

Is there an early-printk facility that can be turned on to see how far
we get in the boot?

Do any of the QEMU machine types [1] approximate this board? I.e. so I
might be able to independently debug.

Were there any boot *successes* on ARM with shuffling enabled? I.e.
clues about what's different about the specific memory setup for
beagle-bone-black.

Thanks for the help!

[1]: https://wiki.qemu.org/Documentation/Platforms/ARM