On Thu, Mar 25, 2021 at 8:50 AM Christian König <christian.koenig@xxxxxxx> wrote: > > Am 25.03.21 um 00:14 schrieb Jason Gunthorpe: > > On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) wrote: > >> On 3/24/21 7:31 PM, Christian König wrote: > >>> > >>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe: > >>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel) > >>>> wrote: > >>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote: > >>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström > >>>>>> (Intel) wrote: > >>>>>> > >>>>>>>> In an ideal world the creation/destruction of page > >>>>>>>> table levels would > >>>>>>>> by dynamic at this point, like THP. > >>>>>>> Hmm, but I'm not sure what problem we're trying to solve > >>>>>>> by changing the > >>>>>>> interface in this way? > >>>>>> We are trying to make a sensible driver API to deal with huge pages. > >>>>>>> Currently if the core vm requests a huge pud, we give it > >>>>>>> one, and if we > >>>>>>> can't or don't want to (because of dirty-tracking, for > >>>>>>> example, which is > >>>>>>> always done on 4K page-level) we just return > >>>>>>> VM_FAULT_FALLBACK, and the > >>>>>>> fault is retried at a lower level. > >>>>>> Well, my thought would be to move the pte related stuff into > >>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK. > >>>>>> > >>>>>> I don't know if the locking works out, but it feels cleaner that the > >>>>>> driver tells the vmf how big a page it can stuff in, not the vm > >>>>>> telling the driver to stuff in a certain size page which it might not > >>>>>> want to do. > >>>>>> > >>>>>> Some devices want to work on a in-between page size like 64k so they > >>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch on > >>>>>> every fault. > >>>>> Hmm, yes, but we would in that case be limited anyway to insert ranges > >>>>> smaller than and equal to the fault size to avoid extensive and > >>>>> possibly > >>>>> unnecessary checks for contigous memory. > >>>> Why? The insert function is walking the page tables, it just updates > >>>> things as they are. It learns the arragement for free while doing the > >>>> walk. > >>>> > >>>> The device has to always provide consistent data, if it overlaps into > >>>> pages that are already populated that is fine so long as it isn't > >>>> changing their addresses. > >>>> > >>>>> And then if we can't support the full fault size, we'd need to > >>>>> either presume a size and alignment of the next level or search for > >>>>> contigous memory in both directions around the fault address, > >>>>> perhaps unnecessarily as well. > >>>> You don't really need to care about levels, the device should be > >>>> faulting in the largest memory regions it can within its efficiency. > >>>> > >>>> If it works on 4M pages then it should be faulting 4M pages. The page > >>>> size of the underlying CPU doesn't really matter much other than some > >>>> tuning to impact how the device's allocator works. > >> Yes, but then we'd be adding a lot of complexity into this function that is > >> already provided by the current interface for DAX, for little or no gain, at > >> least in the drm/ttm setting. Please think of the following situation: You > >> get a fault, you do an extensive time-consuming scan of your VRAM buffer > >> object into which the fault goes and determine you can fault 1GB. Now you > >> hand it to vmf_insert_range() and because the user-space address is > >> misaligned, or already partly populated because of a previous eviction, you > >> can only fault single pages, and you end up faulting a full GB of single > >> pages perhaps for a one-time small update. > > Why would "you can only fault single pages" ever be true? If you have > > 1GB of pages then the vmf_insert_range should allocate enough page > > table entries to consume it, regardless of alignment. > > Completely agree with Jason. Filling in the CPU page tables is > relatively cheap if you fill in a large continuous range. > > In other words filling in 1GiB of a linear range is *much* less overhead > than filling in 1<<18 4KiB faults. > > I would say that this is always preferable even if the CPU only wants to > update a single byte. > > > And why shouldn't DAX switch to this kind of interface anyhow? It is > > basically exactly the same problem. The underlying filesystem block > > size is *not* necessarily aligned to the CPU page table sizes and DAX > > would benefit from better handling of this mismatch. > > > >> On top of this, unless we want to do the walk trying increasingly smaller > >> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() and teach > >> it about transhuge page table entries, because pagewalk.c can't be used (It > >> can't populate page tables). That also means apply_to_page_range() needs to > >> be complicated with page table locks since transhuge pages aren't stable and > >> can be zapped and refaulted under us while we do the walk. > > I didn't say it would be simple :) But we also need to stop hacking > > around the sides of all this huge page stuff and come up with sensible > > APIs that drivers can actually implement correctly. Exposing drivers > > to specific kinds of page levels really feels like the wrong level of > > abstraction. > > > > Once we start doing this we should do it everywhere, the io_remap_pfn > > stuff should be able to create huge special IO pages as well, for > > instance. > > Oh, yes please! > > We easily have 16GiB of VRAM which is linear mapped into the kernel > space for each GPU instance. > > Doing that with 1GiB mapping instead of 4KiB would be quite a win. io_remap_pfn is for userspace mmaps. Kernel mappings should be as big as possible already I think for everything. -Daniel > Regards, > Christian. > > > > >> On top of this, the user-space address allocator needs to know how large gpu > >> pages are aligned in buffer objects to have a reasonable chance of aligning > >> with CPU huge page boundaries which is a requirement to be able to insert a > >> huge CPU page table entry, so the driver would basically need the drm helper > >> that can do this alignment anyway. > > Don't you have this problem anyhow? > > > > Jason > > _______________________________________________ > dri-devel mailing list > dri-devel@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/dri-devel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch