On Thu, Jan 7, 2021 at 1:16 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > [...] > > >>> Well, I would love to have no surprises either. So far there was not > >>> actual argument why the pmem reserved space cannot be fully initialized. > >> > >> Yes, I'm still hoping Dan can clarify that. > > > > Complexity and effective utility (once pfn_to_online_page() is fixed) > > are the roadblocks in my mind. The altmap is there to allow for PMEM > > capacity to be used as memmap space, so there would need to be code to > > break that circular dependency and allocate a memmap for the metadata > > space from DRAM and the rest of the memmap space for the data capacity > > from pmem itself. That memmap-for-pmem-metadata will still represent > > offline pages. So once pfn_to_online_page() is fixed, what pfn-walker > > is going to be doing pfn_to_page() on PMEM metadata? Secondly, there > > Assume I do > > pgmap = get_dev_pagemap(pfn, NULL); > if (pgmap) > return pfn_to_page(pfn); > return NULL; > > on a random pfn because I want to inspect ZONE_DEVICE PFNs. I keep getting hung up on the motivation to do random pfn inspection? The problems we have found to date have required different solutions. The KVM bug didn't use get_dev_pagemap() to inspect the pfn because it could rely on the fact that the page already had an elevated reference count. The get_user_pages() path only looks up ZONE_DEVICE pfns when it see {pte,pmd,pud}_devmap set in the page table entry. pfn walkers have been a problem, but with pfn_to_online_page() fixed what is the remaining motivation to inspect ZONE_DEVICE pfns? > IIUC, the memmap I get might usually be initialized, except we're > hitting a PFN in the reserved altmap space. Correct? The pagemap currently returns true for every pfn in the range including those in the altmap. > > Do we expect this handling to not be valid - i.e., initialization to be > of no utility? (to make it fly we would have to explicitly check for > PFNs in the altmap reserved space, which wouldn't be too problematic) > > > is a PMEM namespace mode called "raw" that eschews DAX and 'struct > > page' for pmem and just behaves like a memory-backed block device. The > > end result is still that pfn walkers need to discover if a PMEM pfn > > has a page, so I don't see what "sometimes there's an > > memmap-for-pmem-metadata" buys us? > > Right, but that's as easy as doing a pfn_valid() first. > > > Let me try to express what I (and I think Michal) mean: > > In pagemap_range(), we > > 1. move_pfn_range_to_zone()->memmap_init_zone(): initialize the memmap > of the PMEM device used as memmap itself ("self host", confusing). We > skip initializing the memmap for the the reserved altmap space. > > 2. memmap_init_zone_device(): initialize the memmap of everything > outside of the altmap space. > > IIUC, the memmap of the reserved altmap space remains uninitialized. > What speaks against just initializing that part in e.g., 2. or similarly > after 1.? > > > I'm planning on documenting the result of this discussion in the code, > so people don't have to scratch their head whenever stumbling over the > altmap reserved space. > > > > >> > >>> On the other hand making sure that pfn_to_online_page sounds like the > >>> right thing to do. And having an explicit check for zone device there in > >>> a slow path makes sense to me. > >> > >> As I said, I'd favor to simplify and just get rid of the special case, > >> instead of coming up with increasingly complex ways to deal with it. > >> pfn_to_online_page() used to be simple, essentially checking a single > >> flag was sufficient in most setups. > > > > I think the logic to throw away System RAM that might collide with > > PMEM and soft-reserved memory within a section is on the order of the > > same code complexity as the patch proposed here, no? Certainly the > > throw-away concept itself is easier to grasp, but I don't think that > > would be reflected in the code patch to achieve it... willing to be > > proved wrong with a patch. > > Well, at least it sounds easier to go over memblock holes and > align-up/down some relevant PFNs to section boundaries, ending up with > no affect to runtime performance later (e.g., pfn_to_online_page()). But > I agree that most probably the devil is in the detail - e.g., handling > different kind of holes (different flavors of "reserved") and syncing up > other data structures (e.g., resource tree). > > I don't have time to look into that right now, but might look into it in > the future. For now I'm fine with this approach, assuming we don't > discover other corner cases that turn it even more complex. I'm happy > that we finally talk about it and fix it! > > > -- > Thanks, > > David / dhildenb >