On Wed, Aug 28, 2024 at 7:41 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote: > > On Tue, Aug 27, 2024 at 3:57 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote: > > > > Hi Peter, > > > > > > Hi, Jiaqi, > > > > > > > I am curious if there is any work needed for unmap_mapping_range? If a > > > > driver hugely remap_pfn_range()ed at 1G granularity, can the driver > > > > unmap at PAGE_SIZE granularity? For example, when handling a PFN is > > > > > > Yes it can, but it'll invoke the split_huge_pud() which default routes to > > > removal of the whole pud right now (currently only covers either DAX > > > mappings or huge pfnmaps; it won't for anonymous if it comes, for example). > > > > > > In that case it'll rely on the driver providing proper fault() / > > > huge_fault() to refault things back with smaller sizes later when accessed > > > again. > > > > I see, so the driver needs to drive the recovery process, and code > > needs to be in the driver. > > > > But it seems to me the recovery process will be more or less the same > > to different drivers? In that case does it make sense that > > memory_failure do the common things for all drivers? > > > > Instead of removing the whole pud, can driver or memory_failure do > > something similar to non-struct-page-version of split_huge_page? So > > driver doesn't need to re-fault good pages back? > > I think we can, it's just that we don't yet have a valid use case. > > DAX is definitely fault-able. > > While for the new huge pfnmap, currently vfio is the only user, and vfio > only requires to either zap all or map all. In that case there's no real > need to ask for what you described yet. Meanwhile it's also faultable, so > if / when needed it should hopefully still do the work properly. > > I believe it's not usual requirement too for most of the rest drivers, as > most of them don't even support fault() afaiu. remap_pfn_range() can start > to use huge mappings, however I'd expect they're mostly not ready for > random tearing down of any MMIO mappings. > > It sounds doable to me though when there's a need of what you're > describing, but I don't think I know well on the use case yet. > > > > > > > > > > > > poisoned in the 1G mapping, it would be great if the mapping can be > > > > splitted to 2M mappings + 4k mappings, so only the single poisoned PFN > > > > is lost. (Pretty much like the past proposal* to use HGM** to improve > > > > hugetlb's memory failure handling). > > > > > > Note that we're only talking about MMIO mappings here, in which case the > > > PFN doesn't even have a struct page, so the whole poison idea shouldn't > > > apply, afaiu. > > > > Yes, there won't be any struct page. Ankit proposed this patchset* for > > handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci > > driver adopts your change via new remap_pfn_range (install PMD/PUD > > instead of PTE), and memory_failure_pfn still > > unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE, > > 0), can it somehow just work and no re-fault needed? > > > > * https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@xxxxxxxxxx/#t > > I see now, interesting.. Thanks for the link. > > In that case of nvgpu usage, one way is to do as what you said; we can > enhance the pmd/pud split for pfnmap, but maybe that's an overkill. Yeah, just want a poke to see if splitting pmd/pud is some low-hanging fruit. > > I saw that the nvgpu will need a fault() anyway so as to detect poisoned > PFNs, then it's also feasible that in the new nvgrace_gpu_vfio_pci_fault() > when it supports huge pfnmaps it'll need to try to detect whether the whole > faulting range contains any poisoned PFNs, then provide FALLBACK if so > (rather than VM_FAULT_HWPOISON). > > E.g., when 4K of 2M is poisoned, we'll erase the 2M completely. When > access happens, as long as the accessed 4K is not on top of the poisoned > 4k, huge_fault() should still detect that there's 4k range poisoned, then > it'll not inject pmd but return FALLBACK, then in the fault() it'll see > the accessed 4k range is not poisoned, then install a pte. Thanks for illustrating the re-fault flow again. I think this should work well for drivers (having large MMIO size) that care about memory errors. We can put the pmd/pud split idea to backlog and see if it is needed in future. > > Thanks, > > -- > Peter Xu >