Re: [PATCH] efifb: allow user to disable write combined mapping.

Dave Airlie <airlied@xxxxxxxxx> · Wed, 19 Jul 2017 06:44:49 +1000

On 19 July 2017 at 05:57, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Jul 18, 2017 at 7:34 AM, Peter Jones <pjones@xxxxxxxxxx> wrote:
>>
>> Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/
>> using ioremap_wc() for the exact same reason.  I'm not against letting
>> the user force one way or the other if it helps, though it sure would be
>> nice to know why.
>
> It's kind of amazing for another reason too: how is ioremap_wc()
> _possibly_ slower than ioremap_nocache() (which is what plain
> ioremap() is)?

In normal operation the console is faster with _wc. It's the side effects
on other cores that is the problem.

> Or maybe it really is something where there is one global write queue
> per die (not per CPU), and having that write queue "active" doing
> combining will slow down every core due to some crazy synchronization
> issue?
>
> x86 people, look at what Dave Airlie did, I'll just repeat it because
> it sounds so crazy:
>
>> A customer noticed major slowdowns while logging to the console
>> with write combining enabled, on other tasks running on the same
>> CPU. (10x or greater slow down on all other cores on the same CPU
>> as is doing the logging).
>>
>> I reproduced this on a machine with dual CPUs.
>> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core)
>>
>> I wrote a test that just mmaps the pci bar and writes to it in
>> a loop, while this was running in the background one a single
>> core with (taskset -c 1), building a kernel up to init/version.o
>> (taskset -c 8) went from 13s to 133s or so. I've yet to explain
>> why this occurs or what is going wrong I haven't managed to find
>> a perf command that in any way gives insight into this.
>
> So basically the UC vs WC thing seems to slow down somebody *else* (in
> this case a kernel compile) on another core entirely, by a factor of
> 10x. Maybe the WC writer itself is much faster, but _others_ are
> slowed down enormously.
>
> Whaa? That just seems incredible.

Yes I've been staring at this for a while now trying to narrow it down, I've
been a bit slow on testing it on a wider range of Intel CPUs, I've
only really managed
to play on that particular machine,

I've attached two test files. compile both of them (I just used make
write_resource burn-cycles).

On my test CPU core 1/8 are on same die.

time taskset -c 1 ./burn-cycles
takes about 6 seconds

taskset -c 8 ./write_resource wc
taskset -c 1 ./burn-cycles
takes about 1 minute.

Now I've noticed write_resource wc or not wc doesn't seem to make a
difference, so
I think it matters that efifb has used _wc for the memory area already
and set PAT on it for wc,
and we always get wc on that BAR.

>From the other person seeing it:
"I done a similar test some time ago, the result was the same.
I ran some benchmarks, and it seems that when data set fits in L1
cache there is no significant performance degradation."

Dave.
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
	int i, j;
	char *resname;

	if (argc > 1 && !strcmp(argv[1], "wc"))
		resname = "/sys/bus/pci/devices/0000:01:00.1/resource0_wc";
	else
		resname = "/sys/bus/pci/devices/0000:01:00.1/resource0";

	int fd = open(resname, O_RDWR);
	if (fd == -1)
		return -1;

	void *ptr = mmap(NULL, 64*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	if (!ptr)
		return -1;

	volatile uint32_t *uptr = ptr;
	for (j = 0; j < 1024*1024; j++)
	for (i = 0; i < 16*1024; i++) {
		uptr[i] = 0;
	}
	munmap(ptr, 64*1024);
	close(fd);
}
#include <stdio.h>
#include <stdint.h>
#define SIZE 1024*1024

int main(void) {
	volatile int i,j;
	volatile uint32_t x[SIZE];
	for (j = 0; j < 1000; j++)
	for (i = 0; i < SIZE; i++) x[i] = 1;

}