I implemented a new FPU context saving/restoring patch, as previously suggested by Kevin and Ralf. The major change is that we will save the FPU context when we switch out a process, if necessary. The goal is to gurrantee an off-line process always has its FPU context saved in memory and thus free to move aother CPU in a SMP system. The initial experimental patch can be found at the following URL. It is a quick hack to study the performance impact. It should be further optimized. It also needs to be extended so that it works for all CPUs (including the ones without FPU) and becomes true SMP-safe (getting rid of global variable last_task_used_math). http://linux.junsun.net/patches/oss.sgi.com/experiemental/020304-new-fpu-context-switch/patch Here is the pseudo code version of the patch: do_cpu() { if (current->used_math) { /* Using the FPU again. */ - lazy_fpu_switch(last_task_used_math); + restore_fp(current); /* we don't need to save for the current proc */ } else { /* First time FPU user. */ r4xx0_resume() save non_scratch registers + if (current proc owns FPU) { /* t used FPU in the curr run */ + make it turn off FPU for next run + save FPU context to current proc + (note we leave last_task_used_math alone) .... lmbench is run to compare the performance difference on a UP system (NEC VR5500). See the output at the following URL. orig are the unpatched kernel. http://linux.junsun.net/patches/oss.sgi.com/experiemental/020304-new-fpu-context-switch/performance It is obvious there is not much performance difference. And this is not a surprise. A couple of attributes of the patch: 1) it does not save FPU if the proc did not use FPU in the current run 2) when proc uses FPU again in next run, we don't have to restore FPU context if the hardware context has not been used by another proc yet (i.e., last_task_used_math == current) So 1) if no processes are actively using FPU, we don't see much overhead other than a couple of load/branch instructions in resume 2) if most processes are actively using FPU, then we see the same overhead. The saving of FPU context is necessary in this scenario, whether it is done resume() (as in the patch) or a little later in lazy_fpu_swotch() as in the current kernel. 3) The only pathological case which would make the patch bad is when you have a process that actively uses FPU and it frequently switches context with non-FPU-using processes. In this case, the saving of FPU context each time fpu-using proc is switched off is an overhead. If each time the fpu-using process runs through a full time slice, the overhead is very small percentage wise. It is the frequent context switching in this case would make a kill. I am interested in testing any benchmarks that would create case 3). Please let me know if you know any. So much for rambling. Jun