On Sun, 16 Apr 2006, Nick Piggin wrote: > Steven Rostedt wrote: > > > > It's not just about saving memory, but also to make it more robust. But > > that's another story. > > But making it slower isn't going to be popular. You're right and I've been thinking of modifications to fix that. These patches were to shake up ideas. > > Why is your module using so much per-cpu memory, anyway? Wasn't my module anyway. The problem appeared in the -rt patch set, when tracing was turned on. Some module was affected, and grew it's per_cpu size by quite a bit. In fact we had to increase PERCPU_ENOUGH_ROOM by up to something like 300K. > > > > > Since both the offset array, and the variables are mainly read only (only > > written on boot up), added the fact that the added variables are in their > > own section. Couldn't something be done to help pre load this in a local > > cache, or something similar? > > It it would still add to the dependent loads on the critical path, so > it now prevents the compiler/programmer/oooe engine from speculatively > loading the __per_cpu_offset. > > And it does increase cache footprint of per-cpu accesses, which are > supposed to be really light and substitute for [NR_CPUS] arrays. > > I don't think it would have been hard for the original author to make > it robust... just not both fast and robust. PERCPU_ENOUGH_ROOM seems > like an ugly hack at first glance, but I'm fairly sure it was a result > of design choices. > Yeah, and I discovered the reasons for those choices as I worked on this. I've put a little more thought into this and still think there's a solution to not slow things down. Since the per_cpu_offset section is still smaller than the PERCPU_ENOUGH_ROOM and robust, I could still copy it into a per cpu memory field, and even add the __per_cpu_offset to it. This would still save quite a bit of space. So now I'm asking for advice on some ideas that can be a work around to keep the robustness and speed. Is there a way (for archs that support it) to allocate memory in a per cpu manner. So each CPU would have its own variable table in the memory that is best of it. Then have a field (like the pda in x86_64) to point to this section, and use the linker offsets to index and find the per_cpu variables. So this solution still has one more redirection than the current solution (per_cpu_offset__##var -> __per_cpu_offset -> actual_var where as the current solution is __per_cpu_offset -> actual_var), but all the loads would be done from memory that would only be specified for a particular CPU. The generic case would still be the same as the patches I already sent, but the archs that can support it, can have something like the above. Would something like that be acceptible? Thanks, -- Steve