On Sat, 7 Dec 2019 at 23:48, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Fri, Dec 6, 2019 at 7:50 PM Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > The "make goes slow" problem bisects down to b667b8673443 ("pipe: > > Advance tail pointer inside of wait spinlock in pipe_read()"). > > I'm not entirely sure that ends up being 100% true. It did bisect to > that, but the behavior wasn't entirely stable. There definitely is > some nasty timing trigger. > > But I did finally figure out what seems to have been going on with at > least the biggest part of the build performance regression. It's seems > to be a nasty interaction with the scheduler and the GNU make > jobserver, and in particular the pipe wakeups really _really_ do seem > to want to be synchronous both for the readers and the writers. > > When a writer wakes up a reader, we want the reader to react quickly > and vice versa. The most obvious case was for the GNU make jobserver, > where sub-makes would do a single-byte write to the jobserver pipe, > and we want to wake up the reader *immediatly*, because the reader is > actually a lot more important than the writer. The reader is what gets > the next job going, the writer just got done with the last one. > > And when a reader empties a full pipe, it's because the writer is > generating data, and you want to just get the writer going again asap. > > Anyway, I've spent way too much time looking at this and wondering > about odd performance patterns. It seems to be mostly back up to > normal. > > I say "mostly", because I still see times of "not as many concurrent > compiles going as I'd expect". It might be a kbuild problem, it might > be an issue with GNU make (I've seen problems with the make jobserver > wanting many more tokens than expected before and the kernel makefiles > - it migth be about deep subdirectories etc), and it might be some > remaining pipe issue. But my allmodconfig builds aren't _enormously_ > slower than they used to be. > > But there's definitely some unhappy interaction with the jobserver. I > have 16 threads (8 cores with HT), and I generally use "make -j32" to > keep them busy because the jobserver isn't great. The pipe rework made > even that 2x slop not work all that well. Something held on to tokens > too long, and there was definitely some interaction with the pipe > wakeup code. Using "-j64" hid the problem, but it was a problem. > > It might be the new scheduler balancing changes that are interacting > with the pipe thing. I'm adding PeterZ, Ingo and Vincent to the cc, > because I hadn't realized just how important the sync wakeup seems to > be for pipe performance even at a big level. Which version of make should I use to reproduce the problem ? My setup is not the same and my make is a bit old but I haven't been able to reproduce the problem described above on my arm64 octa cores system and v5.5-rc1. All cores are busy with -j16. And even -j8 keeps the cores almost always busy > > I've pushed out my pipe changes. I really didn't want to do that kind > of stuff at the end of the merge window, but I spent a lot more time > than I wanted looking at this code, because I was getting to the point > where the alternative was to just revert it all. > > DavidH, give these a look: > > 85190d15f4ea pipe: don't use 'pipe_wait() for basic pipe IO > a28c8b9db8a1 pipe: remove 'waiting_writers' merging logic > f467a6a66419 pipe: fix and clarify pipe read wakeup logic > 1b6b26ae7053 pipe: fix and clarify pipe write wakeup logic > ad910e36da4c pipe: fix poll/select race introduced by the pipe rework > > the top two of which are purely "I'm fed up looking at this code, this > needs to go" kind of changes. > > In particular, that last change is because I think the GNU jobserver > problem is partly a thundering herd issue: when a job token becomes > free (ie somebody does a one-byte write to an empty jobserver pipe), > it wakes up *everybody* who is waiting for a token. One of them will > get it, and the others will go to sleep again. And then it repeats all > over. I didn't fix it, but it _could_ be fixed with exclusive waits > for readers/writers, but that means more smarts than pipe_wait() can > do. And because the jobserver isn't great at keeping everybody happy, > I'm using a much bigger "make -jX" value than the number of CPU's I > have, which makes the herd bigger. And I suspect none of this helps > the scheduler pick the _right_ process to run, which just makes > scheduling an even bigger problem. > > Linus