On 03/23/23 at 06:44am, Lorenzo Stoakes wrote: > On Thu, Mar 23, 2023 at 10:52:09AM +0800, Baoquan He wrote: > > On 03/22/23 at 06:57pm, Lorenzo Stoakes wrote: > > > Having previously laid the foundation for converting vread() to an iterator > > > function, pull the trigger and do so. > > > > > > This patch attempts to provide minimal refactoring and to reflect the > > > existing logic as best we can, for example we continue to zero portions of > > > memory not read, as before. > > > > > > Overall, there should be no functional difference other than a performance > > > improvement in /proc/kcore access to vmalloc regions. > > > > > > Now we have eliminated the need for a bounce buffer in read_kcore_iter(), > > > we dispense with it, and try to write to user memory optimistically but > > > with faults disabled via copy_page_to_iter_nofault(). We already have > > > preemption disabled by holding a spin lock. We continue faulting in until > > > the operation is complete. > > > > I don't understand the sentences here. In vread_iter(), the actual > > content reading is done in aligned_vread_iter(), otherwise we zero > > filling the region. In aligned_vread_iter(), we will use > > vmalloc_to_page() to get the mapped page and read out, otherwise zero > > fill. While in this patch, fault_in_iov_iter_writeable() fault in memory > > of iter one time and will bail out if failed. I am wondering why we > > continue faulting in until the operation is complete, and how that is done. > > This is refererrring to what's happening in kcore.c, not vread_iter(), > i.e. the looped read/faultin. > > The reason we bail out if failt_in_iov_iter_writeable() is that would > indicate an error had occurred. > > The whole point is to _optimistically_ try to perform the operation > assuming the pages are faulted in. Ultimately we fault in via > copy_to_user_nofault() which will either copy data or fail if the pages are > not faulted in (will discuss this below a bit more in response to your > other point). > > If this fails, then we fault in, and try again. We loop because there could > be some extremely unfortunate timing with a race on e.g. swapping out or > migrating pages between faulting in and trying to write out again. > > This is extremely unlikely, but to avoid any chance of breaking userland we > repeat the operation until it completes. In nearly all real-world > situations it'll either work immediately or loop once. Thanks a lot for these helpful details with patience. I got it now. I was mainly confused by the while(true) loop in KCORE_VMALLOC case of read_kcore_iter. Now is there any chance that the faulted in memory is swapped out or migrated again before vread_iter()? fault_in_iov_iter_writeable() will pin the memory? I didn't find it from code and document. Seems it only falults in memory. If yes, there's window between faluting in and copy_to_user_nofault(). > > > > > If we look into the failing point in vread_iter(), it's mainly coming > > from copy_page_to_iter_nofault(), e.g page_copy_sane() checking failed, > > i->data_source checking failed. If these conditional checking failed, > > should we continue reading again and again? And this is not related to > > memory faulting in. I saw your discussion with David, but I am still a > > little lost. Hope I can learn it, thanks in advance. > > > > Actually neither of these are going to happen. page_copy_sane() checks the > sanity of the _source_ pages, and the 'sanity' is defined by whether your > offset and length sit within the (possibly compound) folio. Since we > control this, we can arrange for it never to happen. > > i->data_source is checking that it's an output iterator, however we would > already have checked this when writing ELF headers at the bare minimum, so > we cannot reach this point with an invalid iterator. > > Therefore it is not possible either cause a failure. What could cause a > failure, and what we are checking for, is specified in copyout_nofault() > (in iov_iter.c) which we pass to the iterate_and_advance() macro. Now we > have a fault-injection should_fail_usercopy() which would just trigger a > redo, or copy_to_user_nofault() returning < 0 (e.g. -EFAULT). > > This code is confusing as this function returns the number of bytes _not > copied_ rather than copied. I have tested this to be sure by the way :) > > Therefore the only way for a failure to occur is for memory to not be > faulted in and thus the loop only triggers in this situation. If we fail to > fault in pages for any reason, the whole operation aborts so this should > cover all angles. > > > ...... > > > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c > > > index 08b795fd80b4..25b44b303b35 100644 > > > --- a/fs/proc/kcore.c > > > +++ b/fs/proc/kcore.c > > ...... > > > @@ -507,13 +503,30 @@ read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter) > > > > > > switch (m->type) { > > > case KCORE_VMALLOC: > > > - vread(buf, (char *)start, tsz); > > > - /* we have to zero-fill user buffer even if no read */ > > > - if (copy_to_iter(buf, tsz, iter) != tsz) { > > > - ret = -EFAULT; > > > - goto out; > > > + { > > > + const char *src = (char *)start; > > > + size_t read = 0, left = tsz; > > > + > > > + /* > > > + * vmalloc uses spinlocks, so we optimistically try to > > > + * read memory. If this fails, fault pages in and try > > > + * again until we are done. > > > + */ > > > + while (true) { > > > + read += vread_iter(iter, src, left); > > > + if (read == tsz) > > > + break; > > > + > > > + src += read; > > > + left -= read; > > > + > > > + if (fault_in_iov_iter_writeable(iter, left)) { > > > + ret = -EFAULT; > > > + goto out; > > > + } > > > } > > > break; > > > + } > > > case KCORE_USER: > > > /* User page is handled prior to normal kernel page: */ > > > if (copy_to_iter((char *)start, tsz, iter) != tsz) { > > >