Hello Ard, Thanks for the reply, please see my comments inline. On 2019/6/10 21:16, Ard Biesheuvel wrote: > On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@xxxxxxxxxx> wrote: >> >> Hi Ard, Will, >> >> This week we were trying to debug an issue of time consuming in mem_init(), >> and leading to this similar solution form Jia He, so I would like to bring this >> thread back, please see my detail test result below. >> >> On 2018/9/7 22:44, Will Deacon wrote: >>> On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote: >>>> On 22 August 2018 at 05:07, Jia He <hejianet@xxxxxxxxx> wrote: >>>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns >>>>> where possible") optimized the loop in memmap_init_zone(). But it causes >>>>> possible panic bug. So Daniel Vacek reverted it later. >>>>> >>>>> But as suggested by Daniel Vacek, it is fine to using memblock to skip >>>>> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID. >>>>> >>>>> More from what Daniel said: >>>>> "On arm and arm64, memblock is used by default. But generic version of >>>>> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does >>>>> not always return the next valid one but skips more resulting in some >>>>> valid frames to be skipped (as if they were invalid). And that's why >>>>> kernel was eventually crashing on some !arm machines." >>>>> >>>>> About the performance consideration: >>>>> As said by James in b92df1de5, >>>>> "I have tested this patch on a virtual model of a Samurai CPU with a >>>>> sparse memory map. The kernel boot time drops from 109 to 62 seconds." >>>>> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64. >>>>> >>>>> Besides we can remain memblock_next_valid_pfn, there is still some room >>>>> for improvement. After this set, I can see the time overhead of memmap_init >>>>> is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G >>>>> memory, pagesize 64k). I believe arm server will benefit more if memory is >>>>> larger than TBs >>>>> >>>> >>>> OK so we can summarize the benefits of this series as follows: >>>> - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds >>>> - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15 >>>> *milliseconds* >>>> >>>> Google was not very helpful in figuring out what a Samurai CPU is and >>>> why we should care about the boot time of Linux running on a virtual >>>> model of it, and the 15 ms speedup is not that compelling either. >> >> Testing this patch set on top of Kunpeng 920 based ARM64 server, with >> 384G memory in total, we got the time consuming below >> >> without this patch set with this patch set >> mem_init() 13310ms 1415ms >> >> So we got about 8x speedup on this machine, which is very impressive. >> > > Yes, this is impressive. But does it matter in the grand scheme of > things? It matters for this machine, because it's for storage and there is a watchdog and the time consuming triggers the watchdog. > How much time does this system take to arrive at this point > from power on? Sorry, I don't have such data, as the arch timer is not initialized and I didn't see the time stamp at this point, but I read the cycles from arch timer before and after the time consuming function to get how much time consumed. > >> The time consuming is related the memory DIMM size and where to locate those >> memory DIMMs in the slots. In above case, we are using 16G memory DIMM. >> We also tested 1T memory with 64G size for each memory DIMM on another ARM64 >> machine, the time consuming reduced from 20s to 2s (I think it's related to >> firmware implementations). >> > > I agree that this optimization looks good in isolation, but the fact > that you spotted a bug justifies my skepticism at the time. On the > other hand, now that we have several independent reports (from you, > but also from the Renesas folks) that the speedup is worthwhile for > real world use cases, I think it does make sense to revisit it. Thank you very much for taking care of this :) > > So what I would like to see is the patch set being proposed again, > with the new data points added for documentation. Also, the commit > logs need to crystal clear about how the meaning of PFN validity > differs between ARM and other architectures, and why the assumptions > that the optimization is based on are guaranteed to hold. I think Jia He no longer works for HXT, if don't mind, I can repost this patch set with Jia He's authority unchanged. Thanks Hanjun