Re: Umem Questions

Magnus Karlsson <magnus.karlsson@xxxxxxxxx> · Mon, 17 May 2021 10:21:40 +0200

On Fri, May 14, 2021 at 9:21 PM Dan Siemon <dan@xxxxxxxxxxxxx> wrote:
>
> I've been trying to work with large Umem areas and have a few questions
> . I'd appreciate any help or pointers. If it makes any difference, my
> AF_XDP testing is with i40e.

These issues are driver independent, but I appreciated that you report
this. As you are very well aware of, some things are driver dependent.

> 1) I use kernel args to reserve huge pages on boot. The application
> mmap call with the huge TLB flag appears to use huge pages as I can see
> the count of used huge pages go up (/proc/meminfo). However, the number
> of pages used by the umem, as shown in ss output, looks to still be 4k
> pages. Are there plans to support huge pages in Umem? How hard would
> this be?

Something similar has been on the todo list for two years, but sadly
neither Björn nor I have had any time to pick this up and cannot see
me having the time to pick it up in the foreseeable future either.
There are at least 3 problems that would have to be addressed in this
area:

1: Using a huge page for the umem kernel mapping. As you have
allocated this using a huge page, it will be physically consecutive.
2: Making sure dma addresses are physically consecutive
3: Using a huge page for the IOMMU and its DMA mappings

#1 and #3 are hard problems, at least in my mind. I am no mm or iommu
guy, but I do not believe that there is support for this in the kernel
for use by kernel mappings. The kernel will break down huge-pages into
4K pages for its own mappings. If I am incorrect, I hope that someone
reading this will correct me. But we should do some mailing list
browsing here to see what the latest thoughts are and what has been
tried before.

As for #2, Björn had some discussions with the iommu maintainer about
this in the past [1]. There is no such interface in the iommu
subsystem today, but components such as graphics drivers use a "hack"
to make sure that this happens and if not fail. We do not have to
fail, as we can always fall back to the method we have today. Today we
have an array (dma_addr_t *dma_pages) to store all the addresses to
the 4K DMA address regions. With this new interface in place, we could
replace the array with just a single address pointing to the start of
the area, improving performance. #2 is a prerequisite for #3 too.
Christoph Hellwig submitted an interface proposal about a year ago
[1], but nobody has taken on the challenge to implement it.

[1] https://lkml.org/lkml/2020/7/8/131

> 2) It looks like there is a limit of 2GB on the maximum Umem size? I've
> tried with and without huge pages. Is this fundamental? How hard would
> it be to increase this?

This was news to me. Do you know where in the xdp_umem_reg code it
complains about this? I guess it is xsk_umem__create() that fails, or?
The only limit I see from a basic inspection of the code is that the
number of packet buffers cannot be larger than a u32 (4G). But you are
not close to that limit.

Björn, do you know where this limit stems from?

Thanks: Magnus

> For both of these, I'd like to try to help make them happen. If the
> kernel side changes are deep or large, it may be beyond me but I can
> offer lab equipment and testing.
>
> Thanks.
>