On Thu, Mar 15, 2018 at 12:50 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote: > On Thu 15-03-18 02:30:41, Daniel Vacek wrote: >> On Wed, Mar 14, 2018 at 3:17 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote: >> > On Tue 13-03-18 23:42:40, Daniel Vacek wrote: >> >> On some architectures (reported on arm64) commit 864b75f9d6b01 ("mm/page_alloc: fix memmap_init_zone pageblock alignment") >> >> causes a boot hang. This patch fixes the hang making sure the alignment >> >> never steps back. >> > >> > I am sorry to be complaining again, but the code is so obscure that I >> >> No worries, I'm glad for any review. Which code exactly you do find >> obscure? This patch or my former fix or the original commit >> introducing memblock_next_valid_pfn()? Coz I'd agree the original >> commit looks pretty obscure... > > As mentioned in the other email, the whole going back and forth in the > same loop is just too ugly to live. It's not really supposed to go back, but I guess you understand. >> > would _really_ appreciate some more information about what is going >> > on here. memblock_next_valid_pfn will most likely return a pfn within >> > the same memblock and the alignment will move it before the old pfn >> > which is not valid - so the block has some holes. Is that correct? >> >> I do not understand what you mean by 'pfn within the same memblock'? > > Sorry, I should have said in the same pageblock > >> And by 'the block has some holes'? > > memblock_next_valid_pfn clearly returns pfn which is within a pageblock > and that is why we do not initialize pages in the begining of the block > while move_freepages_block does really expect the full pageblock to be > initialized properly. That is the fundamental problem, right? Yes, that's correct. >> memblock has types 'memory' (as usable memory) and 'reserved' (for >> unusable mem), if I understand correctly. > > We might not have struct pages for invalid pfns. That really depends on > the memory mode. Sure sparse mem model will usually allocate struct > pages for whole memory sections but that is not universally true and > adding such a suble assumption is simply wrong. This is gray area for me. But if I understand correctly this assumption comes from the code. It was already there and got broken hence I was trying to keep it. If anything needs redesigning I'm all for it. But I was just calming the fire here. I only didn't test on arm, which seems to be the only one different. > I suspect you are making strong assumptions based on a very specific > implementation which might be not true in general. That was the feeling > I've had since the patch was proposed for the first time. This is such a > cluttered area that I am not really sure myself, thoug. I understand. And again this is likely correct. I'll be glad for any assistance here. My limited knowledge is the primary cause for lack of relevant details I guess. What I checked looks like pfn_valid is a generic function used by all arches but arm, which seems to be the only one to implement CONFIG_HAVE_ARCH_PFN_VALID if I didn't miss anything. So if this config is enabled on arm, it uses it's own version of pfn_valid(). If not, I'd expect all arches behave the same. That's where my assumption comes from. > -- > Michal Hocko > SUSE Labs