On Tue, Jan 31, 2023 at 09:47:01AM +0100, David Hildenbrand wrote: > On 12.12.22 14:02, David Hildenbrand wrote: > > On sparc64, there is no HW modified bit, therefore, SW tracks via a SW > > bit if the PTE is dirty via pte_mkdirty(). However, pte_mkdirty() > > currently also unconditionally sets the HW writable bit, which is wrong. > > > > pte_mkdirty() is not supposed to make a PTE actually writable, unless the > > SW writable bit (pte_write()) indicates that the PTE is not > > write-protected. Fortunately, sparc64 also defines a SW writable bit. > > > > For example, this already turned into a problem in the context of > > THP splitting as documented in commit 624a2c94f5b7 ("Partly revert "mm/thp: > > carry over dirty bit when thp splits on pmd") and might be an issue during > > page migration in mm/migrate.c:remove_migration_pte() as well where we: > > if (folio_test_dirty(folio) && is_migration_entry_dirty(entry)) > > pte = pte_mkdirty(pte); > > > > But more general, anything like: > > maybe_mkwrite(pte_mkdirty(pte), vma) > > code is broken on sparc64, because it will unconditionally set the HW > > writable bit even if the SW writable bit is not set. > > > > Simple reproducer that will result in a writable PTE after ptrace > > access, to highlight the problem and as an easy way to verify if it has > > been fixed: > > > > -------------------------------------------------------------------------- > > #include <fcntl.h> > > #include <signal.h> > > #include <unistd.h> > > #include <string.h> > > #include <errno.h> > > #include <stdlib.h> > > #include <sys/mman.h> > > > > static void signal_handler(int sig) > > { > > if (sig == SIGSEGV) > > printf("[PASS] SIGSEGV generated\n"); > > else > > printf("[FAIL] wrong signal generated\n"); > > exit(0); > > } > > > > int main(void) > > { > > size_t pagesize = getpagesize(); > > char data = 1; > > off_t offs; > > int mem_fd; > > char *map; > > int ret; > > > > mem_fd = open("/proc/self/mem", O_RDWR); > > if (mem_fd < 0) { > > fprintf(stderr, "open(/proc/self/mem) failed: %d\n", errno); > > return 1; > > } > > > > map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE|MAP_ANON, -1 ,0); > > if (map == MAP_FAILED) { > > fprintf(stderr, "mmap() failed: %d\n", errno); > > return 1; > > } > > > > printf("original: %x\n", *map); > > > > /* debug access */ > > offs = lseek(mem_fd, (uintptr_t) map, SEEK_SET); > > ret = write(mem_fd, &data, 1); > > if (ret != 1) { > > fprintf(stderr, "pwrite(/proc/self/mem) failed with %d: %d\n", ret, errno); > > return 1; > > } > > if (*map != data) { > > fprintf(stderr, "pwrite(/proc/self/mem) not visible\n"); > > return 1; > > } > > > > printf("ptrace: %x\n", *map); > > > > /* Install signal handler. */ > > if (signal(SIGSEGV, signal_handler) == SIG_ERR) { > > fprintf(stderr, "signal() failed\n"); > > return 1; > > } > > > > /* Ordinary access. */ > > *map = 2; > > > > printf("access: %x\n", *map); > > > > printf("[FAIL] SIGSEGV not generated\n"); > > > > return 0; > > } > > -------------------------------------------------------------------------- > > > > Without this commit (sun4u in QEMU): > > # ./reproducer > > original: 0 > > ptrace: 1 > > access: 2 > > [FAIL] SIGSEGV not generated > > > > Let's fix this by setting the HW writable bit only if both, the SW dirty > > bit and the SW writable bit are set. This matches, for example, how > > s390x handles pte_mkwrite() and pte_mkdirty() -- except, that they have > > to clear the _PAGE_PROTECT bit. > > > > We have to move pte_dirty() and pte_dirty() up. The code patching > > mechanism and handling constants > 22bit is a bit special on sparc64. > > > > With this commit (sun4u in QEMU): > > # ./reproducer > > original: 0 > > ptrace: 1 > > [PASS] SIGSEGV generated > > > > This handling seems to have been in place forever. > > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > Cc: "David S. Miller" <davem@xxxxxxxxxxxxx> > > Cc: Peter Xu <peterx@xxxxxxxxxx> > > Cc: Hev <r@xxxxxx> > > Cc: Anatoly Pugachev <matorola@xxxxxxxxx> > > Cc: Raghavendra K T <raghavendra.kt@xxxxxxx> > > Cc: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> > > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> > > Cc: Juergen Gross <jgross@xxxxxxxx> > > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > > --- > > Ping I agree with David that the current sparc64 impl of pte_mkdirty is suspecious. What David mentioned on page migration above is correct and has another report here from Nick recently: https://lore.kernel.org/all/CADyTPEzsvdRC15+Z5T3oryofwRYqHmHzwqRmJKJoHB3d7Tdayw@xxxxxxxxxxxxxx/ If this patch is hopefully correct (which I cannot tell as I know little on sparc64) and can be merged, it'll be the cleanest solution, comparing to what I provided here: https://lore.kernel.org/all/Y9bvwz4FIOQ+D8c4@x1n/ And I assume it'll also fix things like the reproducer being attached on wrongly applying write bit with FOLL_FORCE, so it fixes more than that. I plan to keep posting that fix I referenced above for the breakage because that'll still be the safest so far, but that can change if someone from sparc64 can have a look at this and ack it. Thanks, -- Peter Xu