Re: rt-tests: cyclictest: Add option to specify main pid affinity

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ahmed,


On 3/24/21 10:32 AM, Ahmed S. Darwish wrote:
Hi Jonathan,


Since I'm doing some CAT-related stuff on RT tasks vs. GPU workloads,
I'm curious, how much was the benefit of CAT ON/OFF?

I'm assuming you're testing iGPU workloads and not on a dedicated GPU since you are mentioning CAT. Or is there any benefit of using CAT with a dedicated GPU?


In your benchmarks you show that the combination of --mainaffinity, CPU
isolation, and CAT, improves worst case latency by 2 micro seconds. If
you keep everything as-is, but disable only CAT, how much change happens
in the results?

First I'd like to mention that my test system had an inclusive cache-architecture. I'd guess that the difference between CAT and no CAT is smaller for exclusive or non-inclusive caches (assuming cyclictest is running on an isolated CPU).

So the results will depend on the amount of isolated CPUs and how much of the shared L3 cache the load on housekeeping CPU uses.

Rendered Markdown: https://gist.github.com/jschwe/3502dbf1e56c85e9bf1a340041885b33

# Isolation capabilities without CAT

## Test 2021-01-31 - Isolate all CPUs on NUMA node 1

The figure below shows a worst-case latency of 4 microseconds
measured by cyclictest on the isolated CPUs on NUMA node 1.

cmdline: `nosmt isolcpus=domain,managed_irq,wq,rcu,misc,kthread,1,3,5,7,9,11 rcu_nocbs=1,3,5,7,9,11 irqaffinity=0,2,4 maxcpus=12 rcu_nocb_poll nowatchdog tsc=nowatchdog processor.max_cstate=1 intel_idle.max_cstate=0`

Test parameters: `sudo taskset -c 0-11 rteval --duration=24h --loads-cpulist=0,2,4,6,8,10 --measurement-cpulist=0-11`

![Figure: Latency of completely isolated node vs housekeeping node](https://gist.githubusercontent.com/jschwe/3502dbf1e56c85e9bf1a340041885b33/raw/962244e4e5309507feb0b4ec0627efbabe064c85/2021-01-31.png)


## Test 2021-02-01 - Isolate only CPU 11

The figure below shows a worst-case latency of 11 microseconds for the isolated CPU 11. Interestingly, the worst-case latencies also increased for the housekeeping CPUs with respect
to the previous test.
It is consistent with other tests I made though, and the worst-case latency of the housekeeping CPUs is reduced
if I isolate all or all-but-one CPUs on node 1.

cmdline: `nosmt isolcpus=domain,managed_irq,wq,rcu,misc,kthread,11 rcu_nocbs=11 irqaffinity=0,2,4 maxcpus=12 rcu_nocb_poll nowatchdog tsc=nowatchdog processor.max_cstate=1 intel_idle.max_cstate=0`

Test parameters: `sudo taskset -c 0-11 rteval --duration=24h --loads-cpulist=0-10 --measurement-cpulist=0-11`

![Figure: CPU 11 latency with load on neighboring CPUs](https://gist.githubusercontent.com/jschwe/3502dbf1e56c85e9bf1a340041885b33/raw/962244e4e5309507feb0b4ec0627efbabe064c85/2021-02-01.png)

Note: The error bars show the unbiased standard error of the mean

Also, how many classes of service (CLOS) your CPU has? How was the cache
bitmask divided vis-a-vis the available CLOSes? And did you assign
isolated CPUs to one CLOS, and non-isolated CPUs to a different CLOS? Or
was the division more granular?

I don't have access to the system anymore, but I think it had 8 CLOS available (according to resctrl).

I always used exclusive bitmasks. I mostly used one CLOS for the isolated CPUs, the default CLOS, and sometimes an additional CLOS for tid-based CAT.Due to the "exclusive" setting in resctrl I had to take away one way of the node 0 cache, even for CLOS that were only intended for node 1, which is a bit unfortunate.

I also tested tid-based vs. CPU based CAT on isolated CPUs and the take-away was it doesn't matter too much:

tid based CAT visibly (negatively) impacts the best-case latencies (1 micro-second bin). However, the differences regarding the worst-case latencies were minor.

In one test, I used CDP to reserve 4-ways (4 MiB) for each code and data (so 8-way total) for 1 cyclictest instance (with 3 measurement threads). For CPU-based CAT the utilization oscillated between 0.98MB and 1.11MB. For tid-based CAT, the utilization oscillated between 98kB and 163kB.

In the next test I only used CAT to reserve 2-ways (2 MiB) shared between code and data,  also for 1 cyclictest instance with 3 measurement threads. In this case the CPU-based approach utilized between 0.45MB and 0.85MB of the reserved L3 cache, but the latencies measured by cyclictest were basically unchanged. The tid-based approach actually had a utilization of 0. I'm assuming that's because more L3 was available to the default CLOS, and the relevant cache-lines were never evicted from that part of the L3 cache, so the reservation didn't even come in to play there.


Kind regards,

--
Ahmed S. Darwish
Linutronix GmbH

Best regards


Jonathan Schwender




[Index of Archives]     [RT Stable]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux