Hello, On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote: > The percpu APIs are extensively used in the Linux kernel to reduce > cacheline contention and improve performance. For some use cases, the > percpu APIs may be too fine-grain for distributed resources whereas > a per-node based allocation may be too coarse as we can have dozens > of CPUs in a NUMA node in some high-end systems. > > This patch introduces a simple per-subnode APIs where each of the > distributed resources will be shared by only a handful of CPUs within > a NUMA node. The per-subnode APIs are built on top of the percpu APIs > and hence requires the same amount of memory as if the percpu APIs > are used. However, it helps to reduce the total number of separate > resources that needed to be managed. As a result, it can speed up code > that need to iterate all the resources compared with using the percpu > APIs. Cacheline contention, however, will increases slightly as each > resource is shared by more than one CPU. As long as the number of CPUs > in each subnode is small, the performance impact won't be significant. > > In this patch, at most 2 sibling groups can be put into a subnode. For > an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled > and 2 when it is not. I understand that there's a trade-off between local access and global traversing and you're trying to find a sweet spot between the two, but this seems pretty arbitrary. What's the use case? What are the numbers? Why are global traversals often enough to matter so much? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html