On Fri, Jun 15, 2018 at 01:07:32AM +0200, Jirka Hladky wrote: > > > > In terms of the speed of migration, it may be worth checking how often the > > mm_numa_migrate_ratelimit tracepoint is triggered with bonus points for > > using > > the nr_pages to calculate how many pages get throttled from migrating. If > > it's high frequency then you could test increasing ratelimit_pages (which > > is set at compile time despite not being a macro). It still may not work > > for tasks that are too short-lived to have enough time to identify a > > misplacement and migration. > > > I have done testing on 2 NUMA and 4 NUMA servers, all equipped with the > same CPUs ( Gold 6126) with 48 and 96 cores respectively. > > I have used ft.C.x and ft.D.x tests with 20 threads on 2 NUMA box and 32 > threads on 4 NUMA box. (This is where I see the biggest perf. drop between > 4.16 and 4.17 kernels). While ft.C is a short-lived test (it takes few > seconds to finish), ft.D is a long test with runtime over 3 minutes with 20 > threads and 4.5 minutes with 20 threads. > Understood. > I have used this command to run the test: > > OMP_NUM_THREADS=${THREADS} trace-cmd record -e > migrate:mm_numa_migrate_ratelimit -o > ${DIR}/${BIN}_${THREADS}_threads_with_trace.trace.dat ./${BIN} > Ok, the fact you're using OpenMP instead of MPI is an important detail. OpenMP threads inherit the numa_preferred_nid from their parent while MPI are usually processes and do not inherit the preferred nid. They also inherit the page tables so even though there is a preferred nid, they also potentially handle NUMA hinting faults. This has an important impact on what the hints look like if there is a window before a thread gets migrated to another socket. > I can see that 2c83362734dad8e48ccc0710b5cd2436a0323893 has caused big > increase in number of mm_numa_migrate_ratelimit events. > That implies the threads are getting throttled and, for NAS at least, indicate why migration is slow. It doesn't apply to stream. > I have tested following 3 kernels: 4.16, 4.16_p1 > (2c83362734dad8e48ccc0710b5cd2436a0323893) and 4.16_p2 (4.16_p1 + 2 patched > from Srikar Dronamra). > > There is clear performance drop going from 4.16 to 4.16_p1. 4.16_p2 shows a > small improvement over 4.16_p1 for ft.C but additional perf. drop for ft.D > on 4 NUMA node server. > Ok, so as expected a higher scan rate is not necessarily a good thing. I've observed before that often it simply increases system CPU usage without any improvement in locality. > I think you have mentioned that you are using NAS benchmark but you don't > see the regression. Correct. > I do wonder if you run NAS with the number of > threads being roughly 1/3 of the available cores - this is the scenario > where I consistently see big perf. drop caused by > 2c83362734dad8e48ccc0710b5cd2436a0323893. > It's possible. Until relatively recently, the NAS configurations used as many CPUs as possible rounded down to a power-of-two or square number where required if MPI was in use. Due to the fact that saturating the machine alters how MPI behaves (and is not great for openMP either), I added configurations that used half of the CPUs. However, that would mean it fits too nicely within sockets. I've added another set for one third of the CPUs and scheduled the tests. Unfortunately, they will not complete quickly as my test grid has a massive backlog of work. > Results are bellow: > Nice one, thanks. It's fairly clear that rate limiting may be a major component and it's worth testing with the ratelimit increased. Given that there have been a lot of improvements on locality and corner cases since the rate limit was first introduced, it may also be worth considering elimintating the rate limiting entirely and see what falls out. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html