On 02.12.22 12:03, David Hildenbrand wrote:
On 01.12.22 23:30, Andrew Morton wrote:
On Thu, 1 Dec 2022 16:42:52 +0100 David Hildenbrand <david@xxxxxxxxxx> wrote:
On 01.12.22 16:28, Peter Xu wrote:
I didn't reply here because I have already replied with the question in
previous version with a few attempts. Quotting myself:
https://lore.kernel.org/all/Y3KgYeMTdTM0FN5W@x1n/
The thing is recovering the pte into its original form is the
safest approach to me, so I think we need justification on why it's
always safe to set the write bit.
I've also got another longer email trying to explain why I think it's the
other way round to be justfied, rather than justifying removal of the write
bit for a read migration entry, here:
And I disagree for this patch that is supposed to fix this hunk:
@@ -243,11 +243,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
entry = pte_to_swp_entry(*pvmw.pte);
if (is_write_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
+ else if (pte_swp_uffd_wp(*pvmw.pte))
+ pte = pte_mkuffd_wp(pte);
if (unlikely(is_zone_device_page(new))) {
if (is_device_private_page(new)) {
entry = make_device_private_entry(new, pte_write(pte));
pte = swp_entry_to_pte(entry);
+ if (pte_swp_uffd_wp(*pvmw.pte))
+ pte = pte_mkuffd_wp(pte);
}
}
David, I'm unclear on what you mean by the above. Can you please
expand?
There is really nothing to justify the other way around here.
If it's broken fix it independently and properly backport it independenty.
But we don't know about any such broken case.
I have no energy to spare to argue further ;)
This is a silent data loss bug, which is about as bad as it gets.
Under obscure conditions, fortunately. But please let's keep working
it. Let's aim for something minimal for backporting purposes. We can
revisit any cleanliness issues later.
Okay, you activated my energy reserves.
David, do you feel that the proposed fix will at least address the bug
without adverse side-effects?
Usually, when I suspect something is dodgy I unconsciously push back
harder than I usually would.
I just looked into the issue once again and realized that this patch
here (and also my alternative proposal) most likely tackles the
more-generic issue from the wrong direction. I found yet another such
bug (most probably two, just too lazy to write another reproducer).
Migration code does the right thing here -- IMHO -- and the issue should
be fixed differently.
I'm testing an alternative patch right now and will share it later
today, along with a reproducer.
mprotect() reproducer attached.
--
Thanks,
David / dhildenb
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <poll.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <linux/memfd.h>
#include <linux/userfaultfd.h>
size_t pagesize;
int uffd;
static void *uffd_thread_fn(void *arg)
{
static struct uffd_msg msg;
ssize_t nread;
while (1) {
struct pollfd pollfd;
int nready;
pollfd.fd = uffd;
pollfd.events = POLLIN;
nready = poll(&pollfd, 1, -1);
if (nready == -1) {
fprintf(stderr, "poll() failed: %d\n", errno);
exit(1);
}
nread = read(uffd, &msg, sizeof(msg));
if (nread <= 0)
continue;
if (msg.event != UFFD_EVENT_PAGEFAULT ||
!(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP)) {
printf("FAIL: wrong uffd-wp event fired\n");
exit(1);
}
printf("PASS: uffd-wp fired\n");
exit(0);
}
}
static int setup_uffd(char *map)
{
struct uffdio_api uffdio_api;
struct uffdio_register uffdio_register;
struct uffdio_range uffd_range;
pthread_t thread;
uffd = syscall(__NR_userfaultfd,
O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
if (uffd < 0) {
fprintf(stderr, "syscall() failed: %d\n", errno);
return -errno;
}
uffdio_api.api = UFFD_API;
uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
return -errno;
}
if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
return -ENOSYS;
}
uffdio_register.range.start = (unsigned long) map;
uffdio_register.range.len = pagesize;
uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
return -errno;
}
pthread_create(&thread, NULL, uffd_thread_fn, NULL);
return 0;
}
int main(int argc, char **argv)
{
struct uffdio_writeprotect uffd_writeprotect;
char *map;
int fd;
pagesize = getpagesize();
fd = memfd_create("test", 0);
if (fd < 0) {
fprintf(stderr, "memfd_create() failed\n");
return -errno;
}
if (ftruncate(fd, pagesize)) {
fprintf(stderr, "ftruncate() failed\n");
return -errno;
}
/* Start out without write protection. */
map = mmap(NULL, pagesize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "mmap() failed\n");
return -errno;
}
if (setup_uffd(map))
return 1;
/* Populate a page ... */
memset(map, 0, pagesize);
/* ... and write-protect it using uffd-wp. */
uffd_writeprotect.range.start = (unsigned long) map;
uffd_writeprotect.range.len = pagesize;
uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
return -errno;
}
/* Write-protect the whole mapping temporarily. */
mprotect(map, pagesize, PROT_READ);
mprotect(map, pagesize, PROT_READ|PROT_WRITE);
/* Test if uffd-wp fires. */
memset(map, 1, pagesize);
printf("FAIL: uffd-wp did not fire\n");
return 1;
}