On Tue, Jul 30, 2019 at 12:55:17PM +0000, Jason Gunthorpe wrote: > I suspect this was added for the ODP conversion that does use both > page sizes. I think the ODP code for this is kind of broken, but I > haven't delved into that.. > > The challenge is that the driver needs to know what page size to > configure the hardware before it does any range stuff. > > The other challenge is that the HW is configured to do only one page > size, and if the underlying CPU page side changes it goes south. > > What I would prefer is if the driver could somehow dynamically adjust > the the page size after each dma map, but I don't know if ODP HW can > do that. > > Since this is all driving toward making ODP use this maybe we should > keep this API? > > I'm not sure I can loose the crappy huge page support in ODP. The problem is that I see no way how to use the current API. To know the huge page size you need to have the vma, and the current API doesn't require a vma to be passed in. That's why I suggested an api where we pass in a flag that huge pages are ok into hmm_range_fault, and it then could pass the shift out, and limits itself to a single vma (which it normally doesn't, that is an additional complication). But all this seems really awkward in terms of an API still. AFAIK ODP is only used by mlx5, and mlx5 unlike other IB HCAs can use scatterlist style MRs with variable length per entry, so even if we pass multiple pages per entry from hmm it could coalesce them. The best API for mlx4 would of course be to pass a biovec-style variable length structure that hmm_fault could fill out, but that would be a major restructure.