On Thu, Nov 26, 2020 at 09:44:26PM +0200, Mike Rapoport wrote: > TBH, the whole interaction between e820 and memblock keeps me puzzled > and I can only make educated guesses why some ranges here are > memblock_reserve()'d and some memblock_add()ed. The mixed usage in that interaction between memblock.reserve and memblock.memory where sometime it's used to reserve overlapping memblock.memory ranges (clearly ok), and sometimes is used on ranges with no overlap (not clear even why it's being called), is what makes the current code messy. We're basically passing down the exact same information (inverted), through two different APIs, if there is no overlap. > I think what should be there is that e820 entries that are essentially > RAM, used by BIOS or not, should be listed in memblock.memory. Then > using memblock_reserve() for parts that BIOS claimed for itself would > have the same semantics as for memory allocated by kernel. > > I.e. if there is a DIMM from 0 to, say 512M, memblock.memory will have a > range [0, 512M]. And areas such as 0x000-0xfff, 0x9d000-0x9ffff will be > in memblock.reserved. > > Than in page_alloc.c we'll know that we have a physical memory bank from > 0 to 512M but there are some ranges that we cannot use. > > I suggested it back then when the issue with compaction was reported at > the first time, but Baoquan mentioned that there are systems that cannot > even tolerate having BIOS reserved areas in the page tables and I didn't > continue to pursue this. That explains why we can't add the e820 non-RAM regions to memblock_add, what's not clear is why it should be required to call memblock_reserve on a region that has no overlap with any memblock_add. Instead of the patch that adds a walk to the memblock.reserve and that requires adding even more memblock_reserve to e820__memblock_setup for type 20, we can add a walk for the memblock.memory holes and then we can remove the memblock_reserve for E820_TYPE_SOFT_RESERVED too. Thanks, Andrea