On Wed, Feb 20, 2002 at 11:14:02AM +0100, Kevin D. Kissell wrote: > One cannot make design decisions based on what one > "thinks is pretty common". Binding threads to CPUs > (CPU affinity) is almost always more efficient when > the behavior of the workload looks like batch FORTRAN > processing. It's when one gets a mix of computational > and interactive jobs that it often creates unfortunate > artifacts, and thus must be handled with care. Today's CPU performance is mainly dictated by exploiting caches as well as possible. So that means timeslices should be as long as possible. At the same time we have the contradicting issue of scheduling latency. The Linux scheduler already contains some heuristics that is trying to find a sweet spot in between those two. > > Not true. For instance, on a processor with hardware FPU, setjmp() > > will save FPU registers. That means most processes will actually end > > up taking the FPU at least once. > > Almost all MIPS/Linux threads, from init() onward, have FPU state, > due to setjmp(), printf() (which uses the FP registers even > if one does not specify a floating point data item or format), etc. Printf doesn't ever use floating point due to possible rounding errors. > Has anyone ever measured the performance impact of > lazy FPU context switching on MIPS? It's one of those > ideas that was trendy in the 1980's, but I recall that when > we implemented it for SVR2 on the Fairchild Clipper > (which had only 16 FP registers), the measured improvement > on average context switch time was tiny - a percent or so. > We left it in, because it worked and it *was* an improvement, > but we would never have gone through the hassle had we > known how little it would buy us. These days I assume the difference to be greater for cache reasons. Our stored fp registers take 256 bytes and also tend to be located at a constant offset from start of the 8kB (64-bit: 16kB) aligned task_struct. Combined with the usually low degree of cache associativity on MIPS that means we'll frequently miss L1. And many MIPS systems still don't come with L2 caches, so fiddling with anything stored in the task_struct may easily become quite expensive. In fact on the worst case CPU, the R4000PC context switching the fprs will result in guaranteed worst case performance, we'll *always* have to writeback / refill the affected cache lines from memory. In this context I should also note that the FP context used by the kernel stores in the 32-bit kernel provides space for 32 double precission registers. We only use the 16/32 register model so will pump twice as many cachelines over the memory bus at postcard speed ... Btw, Fairchild Clipper is the same Clipper that was used by Intergraph? > It occurs to me that we can to some degree "split > the difference" on FPU context management for > SMP if we *always* save the FPU state when a > thread switches out, but preserve the logic that > schedules threads with CU1 inhibited so that the > context is only *loaded* if the thread executes > FP instructions. That would save about half of > the context switch overhead for non-FP-intensive > threads, while eliminating the migration problem. As I also suggested in my other mail. Guess we got a winner. Ralf