On Sat, Oct 07, 2017 at 05:16:09PM +0200, Andrea Arcangeli wrote: > Hello, > > I noticed that the addition of a SIGLARM to the selftest broke the > selftest in how it handles UFFD_EVENT_FORK because of an undocumented > UFFD_EVENT_FORK behavior. The testcase doesn't expect "spurious" fork > events to be received. > > As result two or more child uffds may be received by the monitor > during a single fork() invocation of the parent (only the last uffd > will be for the real child, the previous are all spurious associated > with an mm with mm_users = 0). The monitor thread because it's a > selftest, it expects deterministic behavior so when it receives a > spurious uffd, it thinks it received the real uffd of the child and so > it closes the parent uffd and just listens to such spurious uffd. But > such spurious uffd is not the child uffd and when the parent uffd is > closed the real child runs without userfaultfd monitoring (resulting > in immediate corruption being detected in the child). > > The source of the spurious child uffds is this check in kernel/fork.c: > > recalc_sigpending(); > if (signal_pending(current)) { > retval = -ERESTARTNOINTR; > goto bad_fork_cancel_cgroup; > > In practice this isn't a problem for CRIU or any other real non > cooperative use case, because the parent uffd could never be closed > (the only possible concern about this detail, would be if CRIU could > run out of file descriptors in presence of signal flood, but even that > would be a graceful failure with no memory corruption possible). Indeed in CRIU we don't close the parent uffd and a couple of spurious UFFD_EVENT_FORK won't cause a real problem. Yet, if we'll run out of file descriptors because of signal flood during migration, even with graceful failure we'd loose the migrated process entirely. > We don't have a userfaultfd_exit hook to send a POLLHUP when the mm is > destroyed. If we had that, the spurious uffd could be collected and > release the file handle in the userfaultfd monitor fd space. To send > such POLLHUP we'd need to queue all userfaultfd_ctx in a list in the > mm_struct. > > The other possible solution to the possible concern of running out of > file descriptors in the CRIU userfaultfd monitor, is to simply prevent > the generation of the spurious uffds and in turn removing this > detail. That's not hard but that would move uffd structures way up > into the callers so the UFFD_EVENT_FORK is only delivered after the > above signal_pending check. Currently userfault related code in fork.c neatly fits into dup_mmap() and moving the uffd structures up into the callers would be ugly :( However, calling dup_userfaultfd_complete() in dup_mmap() will cause spurious UFFD_EVENT_FORK if fork() fails at any point after copy_mm(). Maybe we do need to queue all duplicated userfaultfd_ctx in the mm_struct of the child process and call dup_userfaultfd_complete() closer to the end of copy_process(). I'm going to experiment with the list of userfaultfd_ctx in mm_struct, it seems to me that it may have additional value, e.g. to simplify userfaultfd_event_wait_completion(). I'll need a bit of time to see if I'm not talking complete nonsense :) > For now I added an easy reproducer to the testcase and sigprocmask > SIG_BLOCK/UNBLOCK around fork() to verify that such a change restores > all "cooperative" expectations of the "non-cooperative" part of the > testcase. > > This survived fine a 12+ hour load with 4 selftests in parallel > (shmem, hugetlb, hugetlb_shared, anon). > > I found out about this detail the first time with heavy host CPU over > commit that enlarged the guest fork runtime long enough for a SIGALARM > to hit it and it wasn't easy to reproduce until I added the signal > flooder to the selftest. > > I'll cleanup and submit the below selftest fix shortly with a separate > submit, but here the question is if we want to make any change to > UFFD_EVENT_FORK for this signal issue. > > If we change anything in the UFFD_EVENT_FORK processing for this, then > the sigprocmask calls and the two #defines should be dropped from the > selftest so the signal flooder will then validate any kernel change > done for it. > > If we change nothing then this detail should probably be documented > just in case somebody has deterministic expectations for the > non-cooperative features (like the selftest has). > > Thanks, > Andrea > > diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c > index de2f9ec8a87f..3f7b6e91050c 100644 > --- a/tools/testing/selftests/vm/userfaultfd.c > +++ b/tools/testing/selftests/vm/userfaultfd.c > @@ -69,6 +69,9 @@ > #include <setjmp.h> > #include <stdbool.h> > > +//#define REPRODUCE_SPURIOUS_UFFD_EVENT_FORK_IN_SIG_TEST > +//#define REPRODUCE_SPURIOUS_UFFD_EVENT_FORK_IN_EVENTS_TEST > + > #ifdef __NR_userfaultfd > > static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size; > @@ -159,8 +162,7 @@ static void hugetlb_allocate_area(void **alloc_area) > void *area_alias = NULL; > char **alloc_area_alias; > *alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, > - (map_shared ? MAP_SHARED : MAP_PRIVATE) | > - MAP_HUGETLB, > + (map_shared ? MAP_SHARED : MAP_PRIVATE), > huge_fd, *alloc_area == area_src ? 0 : > nr_pages * page_size); > if (*alloc_area == MAP_FAILED) { > @@ -170,7 +172,7 @@ static void hugetlb_allocate_area(void **alloc_area) > > if (map_shared) { > area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, > - MAP_SHARED | MAP_HUGETLB, > + MAP_SHARED, > huge_fd, *alloc_area == area_src ? 0 : > nr_pages * page_size); > if (area_alias == MAP_FAILED) { > @@ -457,8 +459,11 @@ static void *uffd_poll_thread(void *arg) > ret = poll(pollfd, 2, -1); > if (!ret) > fprintf(stderr, "poll error %d\n", ret), exit(1); > - if (ret < 0) > + if (ret < 0) { > + if (errno == EINTR) > + continue; > perror("poll"), exit(1); > + } > if (pollfd[1].revents & POLLIN) { > if (read(pollfd[1].fd, &tmp_chr, 1) != 1) > fprintf(stderr, "read pipefd error\n"), > @@ -470,8 +475,13 @@ static void *uffd_poll_thread(void *arg) > pollfd[0].revents), exit(1); > ret = read(uffd, &msg, sizeof(msg)); > if (ret < 0) { > - if (errno == EAGAIN) > + if (errno == EINTR) { > + fprintf(stderr, "nonblocking read -EINTR\n"); > continue; > + } > + if (errno == EAGAIN) { > + continue; > + } > perror("nonblocking read error"), exit(1); > } > switch (msg.event) { > @@ -734,14 +744,23 @@ static int faulting_process(int signal_test) > count = *area_count(area_dst, nr); > if (count != count_verify[nr]) { > fprintf(stderr, > - "nr %lu memory corruption %Lu %Lu\n", > + "nr %lu memory corruption %Lu %Lu %d\n", > nr, count, > - count_verify[nr]), exit(1); > + count_verify[nr], signal_test), exit(1); > } > } > > - if (signal_test) > + if (signal_test) { > + /* restore SIGBUS defaults after signal_test completed */ > + sigbuf = NULL; > + memset(&act, 0, sizeof(act)); > + act.sa_handler = SIG_DFL; > + if (sigaction(SIGBUS, &act, 0)) { > + perror("sigaction SIGBUS SIG_DFL"); > + return 1; > + } > return signalled != split_nr_pages; > + } > > if (test_type == TEST_HUGETLB) > return 0; > @@ -920,14 +939,25 @@ static int userfaultfd_events_test(void) > if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL)) > perror("uffd_poll_thread create"), exit(1); > > + /* See the comment in background_signal_flood */ > + sigset_t set; > + sigemptyset(&set); > + sigaddset(&set, SIGUSR1); > + sigaddset(&set, SIGALRM); > +#ifndef REPRODUCE_SPURIOUS_UFFD_EVENT_FORK_IN_EVENTS_TEST > + /* FIXME, should be pthread_sigmask */ > + sigprocmask(SIG_BLOCK, &set, NULL); > +#endif > pid = fork(); > + sigprocmask(SIG_UNBLOCK, &set, NULL); > if (pid < 0) > perror("fork"), exit(1); > > if (!pid) > - return faulting_process(0); > + exit(faulting_process(0)); > > - waitpid(pid, &err, 0); > + if (waitpid(pid, &err, 0) != pid) > + perror("waitpid"), exit(1); > if (err) > fprintf(stderr, "faulting process failed\n"), exit(1); > > @@ -985,14 +1015,25 @@ static int userfaultfd_sig_test(void) > if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL)) > perror("uffd_poll_thread create"), exit(1); > > + /* See the comment in background_signal_flood */ > + sigset_t set; > + sigemptyset(&set); > + sigaddset(&set, SIGUSR1); > + sigaddset(&set, SIGALRM); > +#ifndef REPRODUCE_SPURIOUS_UFFD_EVENT_FORK_IN_SIG_TEST > + /* FIXME, should be pthread_sigmask */ > + sigprocmask(SIG_BLOCK, &set, NULL); > +#endif > pid = fork(); > + sigprocmask(SIG_UNBLOCK, &set, NULL); > if (pid < 0) > perror("fork"), exit(1); > > if (!pid) > exit(faulting_process(2)); > > - waitpid(pid, &err, 0); > + if (waitpid(pid, &err, 0) != pid) > + perror("waitpid"), exit(1); > if (err) > fprintf(stderr, "faulting process failed\n"), exit(1); > > @@ -1267,14 +1308,66 @@ static void sigalrm(int sig) > alarm(ALARM_INTERVAL_SECS); > } > > +static void sigusr1(int sig) > +{ > + if (sig != SIGUSR1) > + abort(); > +} > + > +/* > + * This helps reproduce the UFFD_EVENT_FORK behavior where multiple > + * child uffds are created despite there's only one parent. That is ok > + * as long as they're all tracked. Non cooperative users will work > + * fine even if the mm belonging to the additional uffd are already > + * destroyed, simply no userfault or event will happen there. To > + * optimize those away we'd need to send the UFFD_EVENT_FORK after the > + * signal_pending check (way up into the callers of > + * dup_mmap()). Alternatively we'd need a reliable way to be notified > + * with POLLHUP when the mm exits so even if spurious uffd are > + * initially received by the monitor thread, they can be garbage > + * collected, but that would require to queue up userfaultfd_ctx in > + * the mm_struct. This signal flood makes sure signals are working > + * fine at all times and the only tricky part is the UFFD_EVENT_FORK > + * handling. If you make cooperative assumptions on non cooperative > + * UFFD_EVENT_FORK, masking signals around fork() is necessary with > + * the currrent API (and it will remain backwards compatible even if > + * we lift this requirement). > + */ > +static pid_t background_signal_flood(void) > +{ > + pid_t parent_pid = getpid(), pid; > + int n; > + pid = fork(); > + if (pid < 0) > + perror("fork"), exit(1); > + if (!pid) { > + for (;;) { > + for (n = 0; n < 1000; n++) { > + if (kill(parent_pid, SIGUSR1)) > + exit(0); > + usleep(1000); > + } > + sleep(1); > + } > + } > + return pid; > +} > + > int main(int argc, char **argv) > { > + pid_t signal_flooder; > + int ret; > if (argc < 4) > fprintf(stderr, "Usage: <test type> <MiB> <bounces> [hugetlbfs_file]\n"), > exit(1); > > + /* FIXME: signal not to be used in multithreaded... */ > if (signal(SIGALRM, sigalrm) == SIG_ERR) > fprintf(stderr, "failed to arm SIGALRM"), exit(1); > + if (signal(SIGUSR1, sigusr1) == SIG_ERR) > + fprintf(stderr, "failed to arm SIGUSR1"), exit(1); > + > + signal_flooder = background_signal_flood(); > alarm(ALARM_INTERVAL_SECS); > > set_test_type(argv[1]); > @@ -1312,7 +1405,11 @@ int main(int argc, char **argv) > } > printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n", > nr_pages, nr_pages_per_cpu); > - return userfaultfd_stress(); > + ret = userfaultfd_stress(); > + kill(signal_flooder, SIGTERM); > + if (waitpid(signal_flooder, NULL, 0) != signal_flooder) > + perror("waitpid"), exit(1); > + return ret; > } > > #else /* __NR_userfaultfd */ > -- Sincerely yours, Mike. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html