On 2023-07-12 12:24, Helge Deller wrote:
Hi Matoro, * matoro <matoro_mailinglist_kernel@xxxxxxxxx>:On 2023-03-14 13:16, Jens Axboe wrote: > From: Helge Deller <deller@xxxxxx> > > Some architectures have memory cache aliasing requirements (e.g. parisc) > if memory is shared between userspace and kernel. This patch fixes the > kernel to return an aliased address when asked by userspace via mmap(). > > Signed-off-by: Helge Deller <deller@xxxxxx> > Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > --- > io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 51 insertions(+) > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 722624b6d0dc..3adecebbac71 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -72,6 +72,7 @@ > #include <linux/io_uring.h> > #include <linux/audit.h> > #include <linux/security.h> > +#include <asm/shmparam.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/io_uring.h> > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file > *file, struct vm_area_struct *vma) > return remap_pfn_range(vma, vma->vm_start, pfn, sz, > vma->vm_page_prot); > } > > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp, > + unsigned long addr, unsigned long len, > + unsigned long pgoff, unsigned long flags) > +{ > + const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags); > + struct vm_unmapped_area_info info; > + void *ptr; > + > + /* > + * Do not allow to map to user-provided address to avoid breaking the > + * aliasing rules. Userspace is not able to guess the offset address > of > + * kernel kmalloc()ed memory area. > + */ > + if (addr) > + return -EINVAL; > + > + ptr = io_uring_validate_mmap_request(filp, pgoff, len); > + if (IS_ERR(ptr)) > + return -ENOMEM; > + > + info.flags = VM_UNMAPPED_AREA_TOPDOWN; > + info.length = len; > + info.low_limit = max(PAGE_SIZE, mmap_min_addr); > + info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base); > +#ifdef SHM_COLOUR > + info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL); > +#else > + info.align_mask = PAGE_MASK & (SHMLBA - 1UL); > +#endif > + info.align_offset = (unsigned long) ptr; > + > + /* > + * A failed mmap() very likely causes application failure, > + * so fall back to the bottom-up function here. This scenario > + * can happen with large stack limits and large mmap() > + * allocations. > + */ > + addr = vm_unmapped_area(&info); > + if (offset_in_page(addr)) { > + info.flags = 0; > + info.low_limit = TASK_UNMAPPED_BASE; > + info.high_limit = mmap_end; > + addr = vm_unmapped_area(&info); > + } > + > + return addr; > +} > + > #else /* !CONFIG_MMU */ > > static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops > = { > #ifndef CONFIG_MMU > .get_unmapped_area = io_uring_nommu_get_unmapped_area, > .mmap_capabilities = io_uring_nommu_mmap_capabilities, > +#else > + .get_unmapped_area = io_uring_mmu_get_unmapped_area, > #endif > .poll = io_uring_poll, > #ifdef CONFIG_PROC_FSHi Jens, Helge - I've bisected a regression with io_uring on ia64 to this patch in 6.4. Unfortunately this breaks userspace programs using io_uring, the easiest one to test is cmake with an io_uring enabled libuv (i.e., libuv>= 1.45.0) which will hang.I am aware that ia64 is in a vulnerable place right now which I why I amkeeping this spread limited. Since this clearly involves architecture-specific changes for parisc,it isn't so much architecture-specific... (just one ifdef)is there any chance of looking at what is required to do the same for ia64? I looked at0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc: change value of SHMLBAfrom 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA -> SHM_COLOUR change, but it made no difference.If hardware is necessary for testing, I can provide it, including remote BMCaccess for restarts/kernel debugging. Any takers?I won't have time to test myself, but maybe you could test?Basically we should try to find out why io_uring_mmu_get_unmapped_area()doesn't return valid addresses, while arch_get_unmapped_area() [in arch/ia64/kernel/sys_ia64.c] does. You could apply this patch first: It introduces a memory leak (as it requests memory twice), but maybe we get an idea? The ia64 arch_get_unmapped_area() searches for memory from bottom (flags=0), while io_uring function tries top-down first. Maybe that's the problem. And I don't understand the offset_in_page() check right now. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 3bca7a79efda..93b1964d2bbb 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c@@ -3431,13 +3431,17 @@ static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,* can happen with large stack limits and large mmap() * allocations. */+/* compare to arch_get_unmapped_area() in arch/ia64/kernel/sys_ia64.c */addr = vm_unmapped_area(&info); - if (offset_in_page(addr)) { +printk("io_uring_mmu_get_unmapped_area() address 1 is: %px\n", addr); + addr = NULL; + if (!addr) { info.flags = 0; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = mmap_end; addr = vm_unmapped_area(&info); }+printk("io_uring_mmu_get_unmapped_area() returns address %px\n", addr);return addr; }Another option is to disable the call to io_uring_nommu_get_unmapped_area()) with the next patch. Maybe you could add printks() to ia64's arch_get_unmapped_area()and check what it returns there?@@ -3654,6 +3658,8 @@ static const struct file_operations io_uring_fops = {#ifndef CONFIG_MMU .get_unmapped_area = io_uring_nommu_get_unmapped_area, .mmap_capabilities = io_uring_nommu_mmap_capabilities, +#elif 0 /* IS_ENABLED(CONFIG_IA64) */ + .get_unmapped_area = NULL, #else .get_unmapped_area = io_uring_mmu_get_unmapped_area, #endif Helge
Thanks Helge. Sample output from that first patch:[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff40000 [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001e40000 [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff20000 [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f20000 [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff30000 [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f30000 [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff90000 [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f90000
This pattern seems to be pretty stable, I tried instead just directly returning the result of a call to arch_get_unmapped_area() at the end of the function and it seems similar:
[Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1ffffffffffd0000 [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000001f00000 [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1ffffffffff00000 [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 1ffffffffff00000 [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1fffffffffe20000 [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000002000000 [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1fffffffffe30000 [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000002100000
Is that enough of a clue to go on?