Hi guys, I have looked a bit more at this issue and found ways to reproduce and looked further at both James Hogan's and Paul Burton's patches. --------------------------------------oOo-------------------------------------- The Problem ----------- On our MIPS 24KEc, the dcache is 32 KBytes, 4-way and had Linux utilized a page size of 8 Kbytes, we wouldn't have this dcache aliasing problem. With a 4 Kbytes page size, however, it must be ensured that the color of user-land pages is the same as the color of kernel-space pages when memory is shared between the two. In our case, this means that NOT only bits 11:0 (page-aligned) of the addresses must be identical, but also bit 12 (making them color-aligned). In order to expose the problem, we must therefore attempt to have VDSO in kernel have a different page color than VDSO for the user-land mapping. When a program loads, the first data that gets allocated is for glibc's loader (ld-2.27.so in our case), and the next thing that gets allocated is the two pages needed for VDSO ([vvar] and [vdso]). Therefore, the page color of user-space VDSO highly depends on the size of the loader's requested data. In my original post (https://www.linux-mips.org/archives/linux-mips/2017-06/msg00621.html), I wrote that it started happening when compiling glibc with '-fasynchronous-unwind-tables'. This may have changed the loader's data size to go from an even number of pages to an odd number of pages or vice versa, thereby making the color of the subsequent VDSO user-space mapping likely to be different from the kernel's. A change in the linux kernel may also produce this, because of a change in page color of the address where 'vdso_data' (declared in arch/mips/kernel/vdso.c) starts. For completeness, here's a snippet from the pagemap for some random process: Section Name Perm Virt Start Virt End Virt Size Phys Size --------------- ---- ---------- ---------- ---------- ---------- ... rwxp 0x77d04000 0x77d0e000 40960 36864 [vvar] r--p 0x77d0e000 0x77d0f000 4096 0 [vdso] r-xp 0x77d0f000 0x77d10000 4096 4096 /lib/ld-2.27.so r-xp 0x77d10000 0x77d11000 4096 4096 /lib/ld-2.27.so rwxp 0x77d11000 0x77d12000 4096 4096 [stack] rwxp 0x7ff81000 0x7ffa2000 135168 28672 --------------------------------------oOo-------------------------------------- Modify kernel to provoke the issue ---------------------------------- In order to provoke the problem, we must first figure out whether the color of 'vdso_data' in kernel-space is different from [vvar] in user-space. The attached patch named 'vdso-chk-1.patch' prints the address of &vdso_data and the corresponding user-land address and "aligned" if bit 12 are identical and "NOT ALIGNED" if not. If "aligned" is printed for all started processes, I suggest trying the attached patch named 'vdso-chk-2.patch'. This will declare a dummy variable in vdso.c that will cause the linker to place vdso_data at a differently colored page. --------------------------------------oOo-------------------------------------- Reproduce --------- When the error is reproducible, you may want to attempt to provoke it. I've attached a program, 'provoke.c', that will print to stderr whenever two consecutive timestamps are received out of order from the kernel. To increase the chance of errors to occur, the program must be instantiated many times in parallel. The following shell command will create 50 simultaneous instances of it: $ for i in $(seq 50); do provoke > /dev/null & done An example of a snippet of the output when it goes wrong: ... [ 46.926329] tgid = 171, pid = 171, comm = timeofday: data_addr = 0x77f60000, &vdso_data = 0x80525000, &dummy = 0x80524000 => NOT ALIGNED [ 46.986344] tgid = 172, pid = 172, comm = timeofday: data_addr = 0x77126000, &vdso_data = 0x80525000, &dummy = 0x80524000 => NOT ALIGNED [ 47.070821] tgid = 173, pid = 173, comm = timeofday: data_addr = 0x7701c000, &vdso_data = 0x80525000, &dummy = 0x80524000 => NOT ALIGNED [ 47.090460] tgid = 170, pid = 170, comm = timeofday: data_addr = 0x779c2000, &vdso_data = 0x80525000, &dummy = 0x80524000 => NOT ALIGNED [ 47.138366] tgid = 174, pid = 174, comm = timeofday: data_addr = 0x77f60000, &vdso_data = 0x80525000, &dummy = 0x80524000 => NOT ALIGNED [ 47.166330] tgid = 175, pid = 175, comm = timeofday: data_addr = 0x77406000, &vdso_data = 0x80525000, &dummy = 0x80524000 => NOT ALIGNED tid = 126: Ran 10000 times. error_cnt = 0, success_cnt = 10000 tid = 130: Ran 10000 times. error_cnt = 0, success_cnt = 10000 Error: tid = 174: clock_gettime(): Prev = 56043, Cur = 53056, diff = -2987 Error: tid = 161: clock_gettime(): Prev = 56247, Cur = 53060, diff = -3187 Error: tid = 168: clock_gettime(): Prev = 56251, Cur = 53064, diff = -3187 Error: tid = 137: clock_gettime(): Prev = 56255, Cur = 53068, diff = -3187 Error: tid = 175: clock_gettime(): Prev = 56259, Cur = 53072, diff = -3187 Error: tid = 129: clock_gettime(): Prev = 56263, Cur = 53076, diff = -3187 tid = 129: Ran 10000 times. error_cnt = 1, success_cnt = 9999 Error: tid = 155: clock_gettime(): Prev = 56267, Cur = 53078, diff = -3189 Error: tid = 165: clock_gettime(): Prev = 56271, Cur = 53080, diff = -3191 ... --------------------------------------oOo-------------------------------------- Trying out James Hogan's patch ------------------------------ With the error-producing version of vdso-chk-X.patch applied, apply James' patch and run the 'provoke' program again. This works since kernel- and user-space coloring always becomes identical. --------------------------------------oOo-------------------------------------- Trying out Paul Burton's patch ------------------------------ With the error-producing version of vdso-chk-X.patch applied, apply Paul's patch and run the 'provoke' program again. This also works. Paul's patch allocates twice the amount of needed VM, but I guess that's fine, as it's also less intrusive (no changes to mmap.c). Regards, René Nielsen -----Original Message----- From: Paul Burton [mailto:paul.burton@xxxxxxxx] Sent: 30. august 2018 20:01 To: Alexandre Belloni <alexandre.belloni@xxxxxxxxxxx>; Rene Nielsen <rene.nielsen@xxxxxxxxxxxxx>; Hauke Mehrtens <hauke@xxxxxxxxxx> Cc: linux-mips@xxxxxxxxxxxxxx; Paul Burton <paul.burton@xxxxxxxx>; James Hogan <jhogan@xxxxxxxxxx>; stable@xxxxxxxxxxxxxxx Subject: [PATCH] MIPS: VDSO: Match data page cache colouring when D$ aliases EXTERNAL EMAIL When a system suffers from dcache aliasing a user program may observe stale VDSO data from an aliased cache line. Notably this can break the expectation that clock_gettime(CLOCK_MONOTONIC, ...) is, as its name suggests, monotonic. In order to ensure that users observe updates to the VDSO data page as intended, align the user mappings of the VDSO data page such that their cache colouring matches that of the virtual address range which the kernel will use to update the data page - typically its unmapped address within kseg0. This ensures that we don't introduce aliasing cache lines for the VDSO data page, and therefore that userland will observe updates without requiring cache invalidation. Signed-off-by: Paul Burton <paul.burton@xxxxxxxx> Reported-by: Hauke Mehrtens <hauke@xxxxxxxxxx> Reported-by: Rene Nielsen <rene.nielsen@xxxxxxxxxxxxx> Reported-by: Alexandre Belloni <alexandre.belloni@xxxxxxxxxxx> Fixes: ebb5e78cc634 ("MIPS: Initial implementation of a VDSO") Cc: James Hogan <jhogan@xxxxxxxxxx> Cc: linux-mips@xxxxxxxxxxxxxx Cc: stable@xxxxxxxxxxxxxxx # v4.4+ --- Hi Alexandre, Could you try this out on your Ocelot system? Hopefully it'll solve the problem just as well as James' patch but doesn't need the questionable change to arch_get_unmapped_area_common(). Thanks, Paul --- arch/mips/kernel/vdso.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index 019035d7225c..5fb617a42335 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -13,6 +13,7 @@ #include <linux/err.h> #include <linux/init.h> #include <linux/ioport.h> +#include <linux/kernel.h> #include <linux/mm.h> #include <linux/sched.h> #include <linux/slab.h> @@ -20,6 +21,7 @@ #include <asm/abi.h> #include <asm/mips-cps.h> +#include <asm/page.h> #include <asm/vdso.h> /* Kernel-provided data used by the VDSO. */ @@ -128,12 +130,30 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) vvar_size = gic_size + PAGE_SIZE; size = vvar_size + image->size; + /* + * Find a region that's large enough for us to perform the + * colour-matching alignment below. + */ + if (cpu_has_dc_aliases) + size += shm_align_mask + 1; + base = get_unmapped_area(NULL, 0, size, 0, 0); if (IS_ERR_VALUE(base)) { ret = base; goto out; } + /* + * If we suffer from dcache aliasing, ensure that the VDSO data page is + * coloured the same as the kernel's mapping of that memory. This + * ensures that when the kernel updates the VDSO data userland will see + * it without requiring cache invalidations. + */ + if (cpu_has_dc_aliases) { + base = __ALIGN_MASK(base, shm_align_mask); + base += ((unsigned long)&vdso_data - gic_size) & shm_align_mask; + } + data_addr = base + gic_size; vdso_addr = data_addr + PAGE_SIZE; -- 2.18.0
Attachment:
vdso-chk-1.patch
Description: vdso-chk-1.patch
Attachment:
vdso-chk-2.patch
Description: vdso-chk-2.patch
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <time.h> #include <signal.h> #include <string.h> #include <errno.h> #include <unistd.h> #include <sys/syscall.h> #include <inttypes.h> // for i in $(seq 50); do provoke > /dev/null & done static volatile int run = 1; static void ctrl_c_handler(int sig) { run = 0; } static uint64_t milliseconds(void) { struct timespec time; if (clock_gettime(CLOCK_MONOTONIC, &time) == 0) { return ((uint64_t)time.tv_sec * 1000ULL) + (time.tv_nsec / 1000000); } fprintf(stderr, "clock_gettime() failed: %s\n", strerror(errno)); exit(-1); } int main(void) { uint32_t i, error_cnt = 0; uint64_t prev_time = 0; int tid = syscall(SYS_gettid); signal(SIGINT, ctrl_c_handler); for (i = 0; i < 10000 && run; i++) { uint64_t cur_time = milliseconds(); if (cur_time < prev_time) { error_cnt++; fprintf(stderr, "Error: tid = %d: clock_gettime(): Prev = %" PRIu64 ", Cur = %" PRIu64 ", diff = -%" PRIu64 "\n", tid, prev_time, cur_time, prev_time - cur_time); } else { fprintf(stdout, "Info: tid = %d: clock_gettime(): Prev = %" PRIu64 ", Cur = %" PRIu64 ", diff = %" PRIu64 "\n", tid, prev_time, cur_time, cur_time - prev_time); } prev_time = cur_time; } fprintf(stderr, "tid = %d: Ran %u times. error_cnt = %u, success_cnt = %u\n", tid, i, error_cnt, i - error_cnt); return error_cnt ? -1 : 0; }