On Mon, Nov 29, 2021 at 1:22 PM Catalin Marinas <catalin.marinas@xxxxxxx> wrote: > On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote: > > On Sat, Nov 27, 2021 at 4:21 PM Catalin Marinas <catalin.marinas@xxxxxxx> wrote: > > > That's similar, somehow, to the arch-specific probing in one of my > > > patches: [1]. We could do the above if we can guarantee that the maximum > > > error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For > > > arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever > > > need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes > > > even if aligned: reads of large blocks are done in 4 * 16 loads, and if > > > one of them fails e.g. because of the 16-byte sub-page fault, no write > > > is done, hence such larger than 16 delta). > > > > > > If you want something in the generic fault_in_writeable(), we probably > > > need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE > > > increments. But I thought I'd rather keep this in the arch-specific code. > > > > I see, that's even crazier than I'd thought. The looping / probing is > > still pretty generic, so I'd still consider putting it in the generic > > code. > > In the arm64 probe_subpage_user_writeable(), the loop is skipped if > !system_supports_mte() (static label). It doesn't make much difference > for search_ioctl() in terms of performance but I'd like the arch code to > dynamically decide when to probe. An arch_has_subpage_faults() static > inline function would solve this. > > However, the above assumes that the only way of probing is by doing a > get_user/put_user(). A non-destructive probe with MTE would be to read > the actual tags in memory and compare them with the top byte of the > pointer. > > There's the CHERI architecture as well. Although very early days for > arm64, we do have an incipient port (https://www.morello-project.org/). > The void __user * pointers are propagated inside the kernel as 128-bit > capabilities. A fault_in() would check whether the address (bottom > 64-bit) is within the range and permissions specified in the upper > 64-bit of the capability. There is no notion of sub-page fault > granularity here and no need to do a put_user() as the check is just > done on the pointer/capability. > > Given the above, my preference is to keep the probing arch-specific. > > > We also still have fault_in_safe_writeable which is more difficult to > > fix, and fault_in_readable which we don't want to leave behind broken, > > either. > > fault_in_safe_writeable() can be done by using get_user() instead of > put_user() for arm64 MTE and probably SPARC ADI (an alternative is to > read the in-memory tags and compare them with the pointer). So we'd keep the existing fault_in_safe_writeable() logic for the actual fault-in and use get_user() to check for sub-page faults? If so, then that should probably also be hidden in arch code. > For CHERI, that's different again since the fault_in_safe_writeable capability > encodes the read/write permissions independently. > > However, do we actually want to change the fault_in_safe_writeable() and > fault_in_readable() functions at this stage? I could not get any of them > to live-lock, though I only tried btrfs, ext4 and gfs2. As per the > earlier discussion, normal files accesses are guaranteed to make > progress. The only problematic one was O_DIRECT which seems to be > alright for the above filesystems (the fs either bails out after several > attempts or uses GUP to read which skips the uaccess altogether). Only gfs2 uses fault_in_safe_writeable(). For buffered reads, progress is guaranteed because failures are at a byte granularity. O_DIRECT reads and writes happen in device block size granularity, but the pages are grabbed with get_user_pages() before the copying happens. So by the time the copying happens, the pages are guaranteed to be resident, and we don't need to loop around fault_in_*(). You've mentioned before that copying to/from struct page bypasses sub-page fault checking. If that is the case, then the checking probably needs to happen in iomap_dio_bio_iter and dio_refill_pages instead. > Happy to address them if there is a real concern, I just couldn't trigger it. Hopefully it should now be clear why you couldn't. One way of reproducing with fault_in_safe_writeable() would be to use that in btrfs instead of fault_in_writeable(), of course. We're not doing any chunked reads from user space with page faults disabled as far as I'm aware right now, so we probably don't have a reproducer for fault_in_readable(). It would still be worth fixing fault_in_readable() to prevent things from blowing up very unexpectedly later, though. Thanks, Andreas > > > Of course, the above fault_in_writeable() still needs the btrfs > > > search_ioctl() counterpart toget_user_pages change the probing on the actual fault > > > address or offset. > > > > Yes, but that change is relatively simple and it eliminates the need > > for probing the entire buffer, so it's a good thing. Maybe you want to > > add this though: > > > > --- a/fs/btrfs/ioctl.c > > +++ b/fs/btrfs/ioctl.c > > @@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode, > > unsigned long sk_offset = 0; > > - char __user *fault_in_addr; > > + char __user *fault_in_addr, *end; > > > > @@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode, > > fault_in_addr = ubuf; > > + end = ubuf + *buf_size; > > while (1) { > > ret = -EFAULT; > > - if (fault_in_writeable(fault_in_addr, > > - *buf_size - (fault_in_addr - ubuf))) > > + if (fault_in_writeable(fault_in_addr, end - fault_in_addr)) > > break; > > Thanks, I'll add it. > > -- > Catalin >