On 29.08.24 01:46, Jason Gunthorpe wrote:
On Wed, Aug 28, 2024 at 03:45:49PM -0400, Peter Xu wrote:
Meanwhile I'm actually not 100% sure pte_special is only needed in
gup-fast. See vm_normal_page() and for VM_PFNMAP when pte_special bit is
not defined:
} else {
unsigned long off;
off = (addr - vma->vm_start) >> PAGE_SHIFT;
if (pfn == vma->vm_pgoff + off) <------------------ [1]
return NULL;
if (!is_cow_mapping(vma->vm_flags))
return NULL;
}
I suspect things can go wrong when there's assumption on vm_pgoff [1]. At
least vfio-pci isn't storing vm_pgoff for the base PFN, so this check will
go wrong when pte_special is not supported on any arch but when vfio-pci is
present. I suspect more drivers can break it.
Fortunately, we did an excellent job at documenting vm_normal_page():
* There are 2 broad cases. Firstly, an architecture may define a pte_special()
* pte bit, in which case this function is trivial. Secondly, an architecture
* may not have a spare pte bit, which requires a more complicated scheme,
* described below.
*
* A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
* special mapping (even if there are underlying and valid "struct pages").
* COWed pages of a VM_PFNMAP are always normal.
*
* The way we recognize COWed pages within VM_PFNMAP mappings is through the
* rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
* set, and the vm_pgoff will point to the first PFN mapped: thus every special
* mapping will always honor the rule
*
* pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT)
*
* And for normal mappings this is false.
*
remap_pfn_range_notrack() will currently handle that for us:
if (is_cow_mapping(vma->vm_flags)) {
if (addr != vma->vm_start || end != vma->vm_end)
return -EINVAL;
}
Even if [1] would succeed, the is_cow_mapping() check will return NULL and it will
all work as expected, even without pte_special().
Because VM_PFNMAP is easy: in a !COW mapping, everything is special.
I think that is a very important point.
IIRC this was done magically in one of the ioremap pfns type calls,
and if VFIO is using fault instead it won't do it.
This probably needs more hand holding for the driver somehow..
As long as these drivers don't support COW-mappings. It's all good.
And IIUC, we cannot support COW mappings if we don't use remap_pfn_range().
For this reason, remap_pfn_range() also bails out if not the whole VMA is covered
in a COW mapping.
It would be great if we could detect and fail that. Likely when trying to insert
PFNs (*not* using remap_pfn_range) manually we would have to WARN if we stumble over
a COW mapping.
In the meantime, we should really avoid any new VM_PFNMAP COW users ...
So I wonder if it's really the case in real life that only gup-fast would
need the special bit. It could be that we thought it like that, but nobody
really seriously tried run it without special bit yet to see things broke.
Indeed.
VM_PFNMAP for sure works.
VM_MIXEDMAP, I am not so sure. The s390x introduction of pte_special() [again,
I posted the commit] raised why they need it: because pfn_valid() could have
returned non-refcounted pages. One would have to dig if that is even still the
case as of today, and if other architectures have similar constraints.
What arches even use the whole 'special but not special' system?
Can we start banning some of this stuff on non-special arches?
Again, VM_PFNMAP is not a problem. Only VM_MIXEDMAP, and I would love to
see that go. There are some, but not that many users ... but I'm afraid it's
not that easy :)
--
Cheers,
David / dhildenb