On Wed, Sep 8, 2021 at 8:49 PM Peter Oskolkov <posk@xxxxxxx> wrote: > Add helper functions to work atomically with userspace 32/64 bit values - > there are some .*futex.* named helpers, but they are not exactly > what is needed for UMCG; I haven't found what else I could use, so I > rolled these. [...] > +static inline int fix_pagefault(unsigned long uaddr, bool write_fault) > +{ > + struct mm_struct *mm = current->mm; > + int ret; > + > + mmap_read_lock(mm); A minor note: mmap_read_lock() can potentially block for extended amounts of time, so *if* you have a way to safely bail out, it's nice to use mmap_read_lock_killable() to ensure that if the task gets killed, it'll go away quickly instead of continuing to potentially wait on the lock. (But of course, there are situations where you just can't do that, so there you have to use the unconditional mmap_read_lock().) So here, since you need a bailout path anyway in this function, I'd write: if (mmap_read_lock_killable(mm)) return -EINTR; > + ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0, > + NULL); > + mmap_read_unlock(mm); > + > + return ret < 0 ? ret : 0; > +} [...] > +static inline int __try_xchg_user_64(u64 *oval, u64 __user *uaddr, u64 newval) > +{ > + u64 oldval = 0; > + int ret = 0; > + > + asm volatile("\n" > + "1:\txchgq %0, %2\n" > + "2:\n" > + "\t.section .fixup, \"ax\"\n" > + "3:\tmov %3, %0\n" > + "\tjmp 2b\n" > + "\t.previous\n" > + _ASM_EXTABLE_UA(1b, 3b) > + : "=r" (oldval), "=r" (ret), "+m" (*uaddr) > + : "i" (-EFAULT), "0" (newval), "1" (0) > + ); > + > + if (ret) > + return ret; > + > + *oval = oldval; > + return 0; > +} [...] > +/** > + * xchg_64_user - atomically exchange 64-bit values > + * > + * Return: > + * 0 - OK > + * -EFAULT: memory access error > + */ > +static inline int xchg_user_64(u64 __user *uaddr, u64 *val) > +{ > + int ret = -EFAULT; > + > + if (unlikely(!access_ok(uaddr, sizeof(*uaddr)))) > + return -EFAULT; For these atomic xchg operations, I think you should probably also check for proper alignment (something like "(unsigned long)uaddr % sizeof(*uaddr) == 0", I guess)? Otherwise the __try_xchg_user_64() call could hit Split Lock Detection (see <https://lore.kernel.org/all/1555536851-17462-1-git-send-email-fenghua.yu@xxxxxxxxx/>) on newer hardware, meaning the XCHGQ instruction would throw a #AC exception, which would get caught by the kernel because it happened during user access, so then you land on the fixup path that returns -EFAULT, and then this function would assume that it was caused by a pagefault, invoke pagefault handling, the pagefault handling would say "the page is present, try again now", and you'd end up in an infinite loop... (Yes, the futex version doesn't have that - the futex code instead does that check further up, in get_futex_key() and handle_futex_death().) > + pagefault_disable(); > + > + while (true) { > + __uaccess_begin_nospec(); > + ret = __try_xchg_user_64(val, uaddr, *val); > + user_access_end(); > + > + if (!ret) > + break; > + > + if (fix_pagefault((unsigned long)uaddr, true) < 0) > + break; > + } > + > + pagefault_enable(); > + > + return ret; > +} > + > +/** > + * get_user_nosleep - get user value with inline fixup without sleeping. > + * > + * get_user() might sleep and therefore cannot be used in preempt-disabled > + * regions. > + */ If this function is not allowed to sleep, as the comment says... > +#define get_user_nosleep(out, uaddr) \ > +({ \ > + int ret = -EFAULT; \ > + \ > + if (access_ok((uaddr), sizeof(*(uaddr)))) { \ > + pagefault_disable(); \ > + \ > + while (true) { \ > + if (!__get_user((out), (uaddr))) { \ > + ret = 0; \ > + break; \ > + } \ > + \ > + if (fix_pagefault((unsigned long)(uaddr), false) < 0) \ > + break; \ ... then I'm pretty sure you can't call fix_pagefault() here, which acquires the mmap semaphore (which may involve sleeping) and then goes through the pagefault handling path (which can also sleep for various reasons, like allocating memory for pagetables, loading pages from disk / NFS / FUSE, and so on). If you're in some kind of non-sleepable context, and you want to access a userspace address that isn't currently paged in, you have to get out of whatever non-sleepable context you're in before going through the fault-handling path and back into the context you were in. Alternatively, if sleeping isn't possible and getting back out of the sleepable context temporarily is also too hard, you could try to look up the userspace page ahead of time (e.g. during umcg_ctl()) with pin_user_pages_unlocked() and kmap() it into the kernel. That's probably a lot slower than a direct userspace access, but if you only have to do it during umcg_ctl(), that might not be a problem. It also more or less requires that the userspace struct doesn't cross a page boundary (otherwise you'd have to either vmap it or use helpers for accessing the pages), and it means you have 4KiB of extra unswappable memory per thread, and it might worsen memory fragmentation (because pinned pages can't be migrated anymore even though the kernel thought they'd probably be migratable at page allocation time). Since it looks like you want to access userspace memory during sched_submit_work() (e.g. for storing the UMCG_TF_PREEMPTED flag), I think the pin_user_pages_unlocked() approach is what you'll have to use there. There you can then essentially access the userspace structure through its kernel mapping, basically like normal kernel memory (except of course that you have to keep in mind that userspace can concurrently read and write that memory, so instead of plain loads/stores you have to use at least READ_ONCE()/WRITE_ONCE()). You of course won't be able to simply traverse userspace pointers in such a situation, only access the specific userspace object that you've prepared beforehand, but luckily it looks like: * idle_server_tid_ptr is only accessed in get_idle_server() -> which is used from process_waking_worker() -> which runs in sleepable context * idle_workers_ptr is accessed from: -> enqueue_idle_worker -> which is also used from process_waking_worker() so I guess that's not a problem? > + } \ > + \ > + pagefault_enable(); \ > + } \ > + ret; \ > +})