On Tue, Jul 30, 2019 at 05:50:16PM +0000, Jason Gunthorpe wrote: > The way ODP seems to work is once in hugetlb mode the dma addresses > must give huge pages or the page fault will be failed. I think that is > a terrible design, but this is how the driver is .. > > So, from this HMM perspective if the caller asked for huge pages then > the results have to be all huge pages or a hard failure. Which isn't how the page_shift member works at moment. It still allows non-hugetlb mappings even with the member. > It is not negotiated as an optimization like you are thinking. > > [note, I haven't yet checked carefully how this works in ODP, every > time I look at parts of it the thing seems crazy] This seems pretty crazy. Especially as hugetlb use in applications seems to fade in favour of THP, for which this ODP scheme does not seem to work at all. > > The best API for mlx4 would of course be to pass a biovec-style > > variable length structure that hmm_fault could fill out, but that would > > be a major restructure. > > It would work, but the driver has to expand that into a page list > right awayhow. > > We can't even dma map the biovec with today's dma API as it needs the > ability to remap on a page granularity. We can do dma_map_page loops over each biovec entry pretty trivially, and that won't be any worse than the current loop over each page in the hmm dma helpers. Once I get around the work to have a better API for iommu mappings for bio_vecs we could coalesce it similar to how we do it with scatterlist (but without all the mess of a new structure). That work is going to take a little longer through, as it needs the amd and intell iommu drivers to be convered to dma-iommu which isn't making progress as far as I hoped. Let me know if you want to keep this code for now despite the issues, or if we'd rather reimplement it once you've made sense of the ODP code.