Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements

Helge Deller <deller@xxxxxx> · Wed, 12 Jul 2023 22:30:31 +0200

On 7/12/23 21:05, Helge Deller wrote:
On 7/12/23 19:28, matoro wrote:
On 2023-07-12 12:24, Helge Deller wrote:
Hi Matoro,

* matoro <matoro_mailinglist_kernel@xxxxxxxxx>:
On 2023-03-14 13:16, Jens Axboe wrote:
> From: Helge Deller <deller@xxxxxx>
>
> Some architectures have memory cache aliasing requirements (e.g. parisc)
> if memory is shared between userspace and kernel. This patch fixes the
> kernel to return an aliased address when asked by userspace via mmap().
>
> Signed-off-by: Helge Deller <deller@xxxxxx>
> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
> ---
>  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 722624b6d0dc..3adecebbac71 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -72,6 +72,7 @@
>  #include <linux/io_uring.h>
>  #include <linux/audit.h>
>  #include <linux/security.h>
> +#include <asm/shmparam.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/io_uring.h>
> @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
> *file, struct vm_area_struct *vma)
>      return remap_pfn_range(vma, vma->vm_start, pfn, sz,
> vma->vm_page_prot);
>  }
>
> +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
> +            unsigned long addr, unsigned long len,
> +            unsigned long pgoff, unsigned long flags)
> +{
> +    const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
> +    struct vm_unmapped_area_info info;
> +    void *ptr;
> +
> +    /*
> +     * Do not allow to map to user-provided address to avoid breaking the
> +     * aliasing rules. Userspace is not able to guess the offset address
> of
> +     * kernel kmalloc()ed memory area.
> +     */
> +    if (addr)
> +        return -EINVAL;
> +
> +    ptr = io_uring_validate_mmap_request(filp, pgoff, len);
> +    if (IS_ERR(ptr))
> +        return -ENOMEM;
> +
> +    info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> +    info.length = len;
> +    info.low_limit = max(PAGE_SIZE, mmap_min_addr);
> +    info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
> +#ifdef SHM_COLOUR
> +    info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
> +#else
> +    info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
> +#endif
> +    info.align_offset = (unsigned long) ptr;
> +
> +    /*
> +     * A failed mmap() very likely causes application failure,
> +     * so fall back to the bottom-up function here. This scenario
> +     * can happen with large stack limits and large mmap()
> +     * allocations.
> +     */
> +    addr = vm_unmapped_area(&info);
> +    if (offset_in_page(addr)) {
> +        info.flags = 0;
> +        info.low_limit = TASK_UNMAPPED_BASE;
> +        info.high_limit = mmap_end;
> +        addr = vm_unmapped_area(&info);
> +    }
> +
> +    return addr;
> +}
> +
>  #else /* !CONFIG_MMU */
>
>  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
> @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
> = {
>  #ifndef CONFIG_MMU
>      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
> +#else
> +    .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>  #endif
>      .poll        = io_uring_poll,
>  #ifdef CONFIG_PROC_FS

Hi Jens, Helge - I've bisected a regression with io_uring on ia64 to this
patch in 6.4.  Unfortunately this breaks userspace programs using io_uring,
the easiest one to test is cmake with an io_uring enabled libuv (i.e., libuv
>= 1.45.0) which will hang.

I am aware that ia64 is in a vulnerable place right now which I why I am
keeping this spread limited.  Since this clearly involves
architecture-specific changes for parisc,

it isn't so much architecture-specific... (just one ifdef)

is there any chance of looking at
what is required to do the same for ia64?  I looked at
0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc: change value of SHMLBA
from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
SHM_COLOUR change, but it made no difference.

If hardware is necessary for testing, I can provide it, including remote BMC
access for restarts/kernel debugging.  Any takers?

I won't have time to test myself, but maybe you could test?

Basically we should try to find out why io_uring_mmu_get_unmapped_area()
doesn't return valid addresses, while arch_get_unmapped_area()
[in arch/ia64/kernel/sys_ia64.c] does.

You could apply this patch first:
It introduces a memory leak (as it requests memory twice), but maybe we
get an idea?
The ia64 arch_get_unmapped_area() searches for memory from bottom
(flags=0), while io_uring function tries top-down first. Maybe that's
the problem. And I don't understand the offset_in_page() check right
now.

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3bca7a79efda..93b1964d2bbb 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3431,13 +3431,17 @@ static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
      * can happen with large stack limits and large mmap()
      * allocations.
      */
+/* compare to arch_get_unmapped_area() in arch/ia64/kernel/sys_ia64.c */
     addr = vm_unmapped_area(&info);
-    if (offset_in_page(addr)) {
+printk("io_uring_mmu_get_unmapped_area() address 1 is: %px\n", addr);
+    addr = NULL;
+    if (!addr) {
         info.flags = 0;
         info.low_limit = TASK_UNMAPPED_BASE;
         info.high_limit = mmap_end;
         addr = vm_unmapped_area(&info);
     }
+printk("io_uring_mmu_get_unmapped_area() returns address %px\n", addr);

     return addr;
 }


Another option is to disable the call to io_uring_nommu_get_unmapped_area())
with the next patch. Maybe you could add printks() to ia64's arch_get_unmapped_area()
and check what it returns there?

@@ -3654,6 +3658,8 @@ static const struct file_operations io_uring_fops = {
 #ifndef CONFIG_MMU
     .get_unmapped_area = io_uring_nommu_get_unmapped_area,
     .mmap_capabilities = io_uring_nommu_mmap_capabilities,
+#elif 0    /* IS_ENABLED(CONFIG_IA64) */
+    .get_unmapped_area = NULL,
 #else
     .get_unmapped_area = io_uring_mmu_get_unmapped_area,
 #endif

Helge

Thanks Helge.  Sample output from that first patch:

[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff40000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001e40000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff20000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f20000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff30000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f30000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff90000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f90000

This pattern seems to be pretty stable, I tried instead just directly returning the result of a call to arch_get_unmapped_area() at the end of the function and it seems similar:

[Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1ffffffffffd0000
[Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000001f00000
[Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1ffffffffff00000
[Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 1ffffffffff00000
[Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1fffffffffe20000
[Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000002000000
[Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1fffffffffe30000
[Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000002100000

Is that enough of a clue to go on?

SHMLBA on ia64 is 0x100000:
arch/ia64/include/asm/shmparam.h:#define        SHMLBA  (1024*1024)
but the values returned by io_uring_mmu_get_unmapped_area() does not fullfill this.

So, probably ia64's SHMLBA isn't pulled in correctly in io_uring/io_uring.c.
Check value of this line:
     info.align_mask = PAGE_MASK & (SHMLBA - 1UL);

You could also add
#define SHM_COLOUR  0x100000
in front of the
     #ifdef SHM_COLOUR
(define SHM_COLOUR in io_uring/kbuf.c too).

What is the value of PAGE_SIZE and "ptr" on your machine?
For 4k page size I get:
SHMLBA -1   ->        FFFFF
PAGE_MASK   -> FFFFFFFFF000
so,
info.align_mask = PAGE_MASK & (SHMLBA - 1UL) = 0xFF000;
You could try to set nfo.align_mask = 0xfffff;

Helge