Am 2021-10-19 um 10:20 a.m. schrieb Jason Gunthorpe: > On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote: >> On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: >>> On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote: >>> >>>> dev_pagemap_mapping_shift() does a lookup to figure out >>>> which order is the page table entry represents. is_zone_device_page() >>>> is already used to gate usage of dev_pagemap_mapping_shift(). I think >>>> this might be an artifact of the same issue as 3) in which PMDs/PUDs >>>> are represented with base pages and hence you can't do what the rest >>>> of the world does with: >>> This code is looks broken as written. >>> >>> vma_address() relies on certain properties that I maybe DAX (maybe >>> even only FSDAX?) sets on its ZONE_DEVICE pages, and >>> dev_pagemap_mapping_shift() does not handle the -EFAULT return. It >>> will crash if a memory failure hits any other kind of ZONE_DEVICE >>> area. >> That case is gated with a TODO in memory_failure_dev_pagemap(). I >> never got any response to queries about what to do about memory >> failure vs HMM. > Unfortunately neither Logan nor Felix noticed that TODO conditional > when adding new types.. You mean this? if (pgmap->type == MEMORY_DEVICE_PRIVATE) { /* * TODO: Handle HMM pages which may need coordination * with device-side memory. */ goto unlock; } Yeah, I never looked at that. Alex, we'll need to add || pgmap->type == MEMORY_DEVICE_COHERENT here. Or should we change this into a test that looks for the pgmap->types that are actually handled by memory_failure_dev_pagemap? E.g. if (pgmap->type != MEMORY_DEVICE_FS_DAX) goto unlock; I think in case of a real HW error, our driver should be calling memory_failure. But then a callback from here back into the driver wouldn't make sense. For MADV_HWPOISON we may need a callback to the driver, if we want the driver to treat it like an actual HW error and retire the page. > > But maybe it is dead code anyhow as it already has this: > > cookie = dax_lock_page(page); > if (!cookie) > goto out; > > Right before? Doesn't that already always fail for anything that isn't > a DAX? I guess the check for the pgmap->type should come before this. Regards, Felix > >>> I'm not sure the comment is correct anyhow: >>> >>> /* >>> * Unmap the largest mapping to avoid breaking up >>> * device-dax mappings which are constant size. The >>> * actual size of the mapping being torn down is >>> * communicated in siginfo, see kill_proc() >>> */ >>> unmap_mapping_range(page->mapping, start, size, 0); >>> >>> Beacuse for non PageAnon unmap_mapping_range() does either >>> zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd(). >>> >>> Despite it's name __split_huge_pmd() does not actually split, it will >>> call __split_huge_pmd_locked: >>> >>> } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) >>> goto out; >>> __split_huge_pmd_locked(vma, pmd, range.start, freeze); >>> >>> Which does >>> if (!vma_is_anonymous(vma)) { >>> old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); >>> >>> Which is a zap, not split. >>> >>> So I wonder if there is a reason to use anything other than 4k here >>> for DAX? >>> >>>> tk->size_shift = page_shift(compound_head(p)); >>>> >>>> ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0). >>> And what would be so wrong with memory failure doing this as a 4k >>> page? >> device-dax does not support misaligned mappings. It makes hard >> guarantees for applications that can not afford the page table >> allocation overhead of sub-1GB mappings. > memory-failure is the wrong layer to enforce this anyhow - if someday > unmap_mapping_range() did learn to break up the 1GB pages then we'd > want to put the condition to preserve device-dax mappings there, not > way up in memory-failure. > > So we can just delete the detection of the page size and rely on the > zap code to wipe out the entire level, not split it. Which is what we > have today already. > > Jason