On 19 July 2017 at 05:57, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Tue, Jul 18, 2017 at 7:34 AM, Peter Jones <pjones@xxxxxxxxxx> wrote: >> >> Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/ >> using ioremap_wc() for the exact same reason. I'm not against letting >> the user force one way or the other if it helps, though it sure would be >> nice to know why. > > It's kind of amazing for another reason too: how is ioremap_wc() > _possibly_ slower than ioremap_nocache() (which is what plain > ioremap() is)? In normal operation the console is faster with _wc. It's the side effects on other cores that is the problem. > Or maybe it really is something where there is one global write queue > per die (not per CPU), and having that write queue "active" doing > combining will slow down every core due to some crazy synchronization > issue? > > x86 people, look at what Dave Airlie did, I'll just repeat it because > it sounds so crazy: > >> A customer noticed major slowdowns while logging to the console >> with write combining enabled, on other tasks running on the same >> CPU. (10x or greater slow down on all other cores on the same CPU >> as is doing the logging). >> >> I reproduced this on a machine with dual CPUs. >> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core) >> >> I wrote a test that just mmaps the pci bar and writes to it in >> a loop, while this was running in the background one a single >> core with (taskset -c 1), building a kernel up to init/version.o >> (taskset -c 8) went from 13s to 133s or so. I've yet to explain >> why this occurs or what is going wrong I haven't managed to find >> a perf command that in any way gives insight into this. > > So basically the UC vs WC thing seems to slow down somebody *else* (in > this case a kernel compile) on another core entirely, by a factor of > 10x. Maybe the WC writer itself is much faster, but _others_ are > slowed down enormously. > > Whaa? That just seems incredible. Yes I've been staring at this for a while now trying to narrow it down, I've been a bit slow on testing it on a wider range of Intel CPUs, I've only really managed to play on that particular machine, I've attached two test files. compile both of them (I just used make write_resource burn-cycles). On my test CPU core 1/8 are on same die. time taskset -c 1 ./burn-cycles takes about 6 seconds taskset -c 8 ./write_resource wc taskset -c 1 ./burn-cycles takes about 1 minute. Now I've noticed write_resource wc or not wc doesn't seem to make a difference, so I think it matters that efifb has used _wc for the memory area already and set PAT on it for wc, and we always get wc on that BAR. >From the other person seeing it: "I done a similar test some time ago, the result was the same. I ran some benchmarks, and it seems that when data set fits in L1 cache there is no significant performance degradation." Dave.
#include <stdio.h> #include <stdint.h> #include <unistd.h> #include <sys/mman.h> #include <fcntl.h> int main(int argc, char **argv) { int i, j; char *resname; if (argc > 1 && !strcmp(argv[1], "wc")) resname = "/sys/bus/pci/devices/0000:01:00.1/resource0_wc"; else resname = "/sys/bus/pci/devices/0000:01:00.1/resource0"; int fd = open(resname, O_RDWR); if (fd == -1) return -1; void *ptr = mmap(NULL, 64*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (!ptr) return -1; volatile uint32_t *uptr = ptr; for (j = 0; j < 1024*1024; j++) for (i = 0; i < 16*1024; i++) { uptr[i] = 0; } munmap(ptr, 64*1024); close(fd); }
#include <stdio.h> #include <stdint.h> #define SIZE 1024*1024 int main(void) { volatile int i,j; volatile uint32_t x[SIZE]; for (j = 0; j < 1000; j++) for (i = 0; i < SIZE; i++) x[i] = 1; }