On Thu, Jan 28, 2021 at 10:45:49AM +0800, Baoquan He wrote: > On 01/27/21 at 08:26pm, Mike Rapoport wrote: > > Hi Lukasz, > > > > On Wed, Jan 27, 2021 at 02:15:53PM +0100, Łukasz Majczak wrote: > > > Hi Mike, > > > > > > I have started bisecting your patch and I have figured out that there > > > might be something wrong with clamping - with comments out these lines > > > it started to work. > > > The full log (with logs from below patch) can be found here: > > > https://gist.github.com/semihalf-majczak-lukasz/3cecbab0ddc59a6c3ce11ddc29645725 > > > it's fresh - I haven't analyze it yet, just sharing with hope it will help. > > > > Thanks, that helps! > > > > The first page is never considered by the kernel as memory and so > > arch_zone_lowest_possible_pfn[ZONE_DMA] is set to 0x1000. As the result, > > init_unavailable_mem() skips pfn 0 and then __SetPageReserved(page) in > > reserve_bootmem_region() panics because the struct page for pfn 0 remains > > poisoned. > > It's a great finding and quick fix. Unfortunately it's only a partial fix as it does not address the problem of having pfn 0 outside any zone. It gets ZONE_DMA link at init_unavailable_mem(), but zones[ZONE_DMA]->zone_start_pfn is 1. I'm looking now how to fix this as well, hopefully I'll have a patch Real Soon (tm) :) > Previously I tested my cleanup patches based on Mike's commit > 9ebeee59af4cdd4d ("mm: fix initialization of struct page for holes in > memory layout") on a hardware system, didn't meet this crash. But this > crash seems to be a always reproduced issue, wondering why I didn't > reproduce it. This crash is reproducible on systems that do not report pfn 0 as usable, e.g for Chromebook Lukasz is using it is 'type 16': [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] type 16 [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable And on my laptop and on a bunch of other systems I have it is usable: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009cfff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009d000-0x000000000009ffff] reserved > > > > Can you please try the below patch on top of v5.11-rc5? > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 783913e41f65..3ce9ef238dfc 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -7083,10 +7083,11 @@ void __init free_area_init_memoryless_node(int nid) > > static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn, > > int zone, int nid) > > { > > - unsigned long pfn, zone_spfn, zone_epfn; > > + unsigned long pfn, zone_spfn = 0, zone_epfn; > > u64 pgcnt = 0; > > > > - zone_spfn = arch_zone_lowest_possible_pfn[zone]; > > + if (zone > 0) > > + zone_spfn = arch_zone_highest_possible_pfn[zone - 1]; > > zone_epfn = arch_zone_highest_possible_pfn[zone]; > > > > spfn = clamp(spfn, zone_spfn, zone_epfn); > > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index eed54ce26ad1..9f4468c413a1 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -7093,9 +7093,11 @@ static u64 __init > > > init_unavailable_range(unsigned long spfn, unsigned long epfn, > > > zone_spfn = arch_zone_lowest_possible_pfn[zone]; > > > zone_epfn = arch_zone_highest_possible_pfn[zone]; > > > > > > - spfn = clamp(spfn, zone_spfn, zone_epfn); > > > - epfn = clamp(epfn, zone_spfn, zone_epfn); > > > - > > > + //spfn = clamp(spfn, zone_spfn, zone_epfn); > > > + //epfn = clamp(epfn, zone_spfn, zone_epfn); > > > + pr_info("LMA DBG: zone_spfn: %llx, zone_epfn %llx\n", > > > zone_spfn, zone_epfn); > > > + pr_info("LMA DBG: spfn: %llx, epfn %llx\n", spfn, epfn); > > > + pr_info("LMA DBG: clamp_spfn: %llx, clamp_epfn %llx\n", > > > clamp(spfn, zone_spfn, zone_epfn), clamp(epfn, zone_spfn, zone_epfn)); > > > for (pfn = spfn; pfn < epfn; pfn++) { > > > if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { > > > pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) > > > > > > Best regards, > > > Lukasz > > > > > > > > > śr., 27 sty 2021 o 13:15 Łukasz Majczak <lma@xxxxxxxxxxxx> napisał(a): > > > > > > > > Unfortunately nothing :( my current kernel command line contains: > > > > console=ttyS0,115200n8 debug earlyprintk=serial loglevel=7 > > > > > > > > I was thinking about using earlycon, but it seems to be blocked. > > > > (I think the lack of earlycon might be related to Chromebook HW > > > > security design. There is an EC controller which is a part of AP -> > > > > serial chain as kernel messages are considered sensitive from a > > > > security standpoint.) > > > > > > > > Best regards, > > > > Lukasz > > > > > > > > śr., 27 sty 2021 o 12:19 Mike Rapoport <rppt@xxxxxxxxxxxxx> napisał(a): > > > > > > > > > > On Wed, Jan 27, 2021 at 11:08:17AM +0100, Łukasz Majczak wrote: > > > > > > Hi Mike, > > > > > > > > > > > > Actually I have a serial console attached (via servo device), but > > > > > > there is no output :( and also the reboot/crash is very fast/immediate > > > > > > after power on. > > > > > > > > > > If you boot with earlyprintk=serial are there any messages? > > > > > > > > > > > Best regards > > > > > > Lukasz > > > > > > > > > > > > śr., 27 sty 2021 o 11:05 Mike Rapoport <rppt@xxxxxxxxxxxxx> napisał(a): > > > > > > > > > > > > > > Hi Lukasz, > > > > > > > > > > > > > > On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote: > > > > > > > > Crash after mm: fix initialization of struct page for holes in memory layout > > > > > > > > > > > > > > > > Hi, > > > > > > > > I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline), > > > > > > > > but I've noticed it has crashed - unfortunately it seems to happen at > > > > > > > > a very early stage - No output to the console nor to the screen, so I > > > > > > > > have started a bisect (between 5.11-rc4 - which works just find - and > > > > > > > > 5.11-rc5), > > > > > > > > bisect results points to: > > > > > > > > > > > > > > > > d3921cb8be29 mm: fix initialization of struct page for holes in memory layout > > > > > > > > > > > > > > > > Reproduction is just to build and load the kernel. > > > > > > > > > > > > > > > > If it will help any how I am attaching: > > > > > > > > - /proc/cpuinfo (from healthy system): > > > > > > > > https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066 > > > > > > > > - my .config file (for a broken system): > > > > > > > > https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c > > > > > > > > > > > > > > > > If there is anything I could add/do/test to help fix this please let me know. > > > > > > > > > > > > > > Chris Wilson also reported boot failures on several Chromebooks: > > > > > > > > > > > > > > https://lore.kernel.org/lkml/161160687463.28991.354987542182281928@xxxxxxxxxxxxxxxxxxxxx > > > > > > > > > > > > > > I presume serial console is not an option, so if you could boot with > > > > > > > earlyprintk=vga and see if there is anything useful printed on the screen > > > > > > > it would be really helpful. > > > > > > > > > > > > > > > Best regards > > > > > > > > Lukasz > > > > > > > > > > > > > > -- > > > > > > > Sincerely yours, > > > > > > > Mike. > > > > > > > > > > -- > > > > > Sincerely yours, > > > > > Mike. > > > > -- > > Sincerely yours, > > Mike. > > > -- Sincerely yours, Mike.