We've re-evaluated the need for a patch to support some sort of finer grained control over thp, and, based on some tests performed by our benchmarking team, we're seeing that we'd definitely still like to implement some method to support this. Here's an e-mail from John Baron (jbaron@xxxxxxx), on our benchmarking team, containing some data which shows a decrease in performance for some SPEC OMP benchmarks when thp is enabled: > Here are results for SPEC OMP benchmarks on UV2 using 512 threads / 64 > sockets. These show the performance ratio for jobs run with THP > disabled versus THP enabled (so > 1.0 means THP disabled is faster). > One possible reason for lower performance with THP enabled is that the > larger page granularity can result in more remote data accesses. > > > SPEC OMP2012: > > 350.md 1.0 > 351.bwaves 1.3 > 352.nab 1.0 > 357.bt331 0.9 > 358.botsalgn 1.0 > 359.botsspar 1.1 > 360.ilbdc 1.8 > 362.fma3d 1.0 > 363.swim 1.4 > 367.imagick 0.9 > 370.mgrid331 1.1 > 371.applu331 0.9 > 372.smithwa 1.0 > 376.kdtree 1.0 > > SPEC OMPL2001: > > 311.wupwise_l 1.1 > 313.swim_l 1.5 > 315.mgrid_l 1.0 > 317.applu_l 1.1 > 321.equake_l 5.8 > 325.apsi_l 1.5 > 327.gafort_l 1.0 > 329.fma3d_l 1.0 > 331.art_l 0.8 > > One could argue that real-world applications could be modified to avoid > these kinds of effects, but (a) it is not always possible to modify code > (e.g. in benchmark situations) and (b) even if it is possible to do so, > it is not necessarily easy to do so (e.g. for customers with large > legacy Fortran codes). > > We have also observed on Intel Sandy Bridge processors that, as > counter-intuitive as it may seem, local memory bandwidth is actually > slightly lower with THP enabled (1-2%), even with unit stride data > accesses. This may not seem like much of a performance hit but > it is important for HPC workloads. No code modification will help here. In light of the previous issues discussed in this thread, and some suggestions from David Rientjes: > why not make it per-process so users don't have to configure > cpusets to control it? Robin and I have come up with a proposal for a way to replicate behavior similar to what this patch introduced, only on a per-process level instead of at the cpuset level. Our idea would be to add a flag somewhere in the task_struct to keep track of whether or not thp is enabled for each task. The flag would be controlled by an additional option included in prctl(), allowing programmers to set/unset this flag via the prctl() syscall. We would also introduce some code into the clone() syscall to ensure that this flag is copied down from parent to child tasks when necessary. The flag would be checked in the same place the the per-cpuset flag was checked in my original patch, thereby allowing the same behavior to be replicated on a per-process level. In this way, we will also be able to get static binaries to behave appropriately by setting this flag in a userland program, and then having that program exec the static binary for which we need to disable thp. This solution allows us to incorporate the behavior that we're looking for into the kernel, without abusing cpusets for the purpose of containerization. Please let me know if anyone has any objections to this approach, or if you have any suggestions as to how we could improve upon this idea. Thanks! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html