Hi Ahmed,
On 3/24/21 10:32 AM, Ahmed S. Darwish wrote:
Hi Jonathan,
Since I'm doing some CAT-related stuff on RT tasks vs. GPU workloads,
I'm curious, how much was the benefit of CAT ON/OFF?
I'm assuming you're testing iGPU workloads and not on a dedicated GPU
since you are mentioning CAT. Or is there any benefit of using CAT with
a dedicated GPU?
In your benchmarks you show that the combination of --mainaffinity, CPU
isolation, and CAT, improves worst case latency by 2 micro seconds. If
you keep everything as-is, but disable only CAT, how much change happens
in the results?
First I'd like to mention that my test system had an inclusive
cache-architecture. I'd guess that the difference between CAT and no CAT
is smaller for exclusive or non-inclusive caches (assuming cyclictest is
running on an isolated CPU).
So the results will depend on the amount of isolated CPUs and how much
of the shared L3 cache the load on housekeeping CPU uses.
Rendered Markdown:
https://gist.github.com/jschwe/3502dbf1e56c85e9bf1a340041885b33
# Isolation capabilities without CAT
## Test 2021-01-31 - Isolate all CPUs on NUMA node 1
The figure below shows a worst-case latency of 4 microseconds
measured by cyclictest on the isolated CPUs on NUMA node 1.
cmdline: `nosmt
isolcpus=domain,managed_irq,wq,rcu,misc,kthread,1,3,5,7,9,11
rcu_nocbs=1,3,5,7,9,11 irqaffinity=0,2,4 maxcpus=12 rcu_nocb_poll
nowatchdog tsc=nowatchdog processor.max_cstate=1 intel_idle.max_cstate=0`
Test parameters: `sudo taskset -c 0-11 rteval --duration=24h
--loads-cpulist=0,2,4,6,8,10 --measurement-cpulist=0-11`
![Figure: Latency of completely isolated node vs housekeeping
node](https://gist.githubusercontent.com/jschwe/3502dbf1e56c85e9bf1a340041885b33/raw/962244e4e5309507feb0b4ec0627efbabe064c85/2021-01-31.png)
## Test 2021-02-01 - Isolate only CPU 11
The figure below shows a worst-case latency of 11 microseconds for the
isolated CPU 11.
Interestingly, the worst-case latencies also increased for the
housekeeping CPUs with respect
to the previous test.
It is consistent with other tests I made though, and the worst-case
latency of the housekeeping CPUs is reduced
if I isolate all or all-but-one CPUs on node 1.
cmdline: `nosmt isolcpus=domain,managed_irq,wq,rcu,misc,kthread,11
rcu_nocbs=11 irqaffinity=0,2,4 maxcpus=12 rcu_nocb_poll nowatchdog
tsc=nowatchdog processor.max_cstate=1 intel_idle.max_cstate=0`
Test parameters: `sudo taskset -c 0-11 rteval --duration=24h
--loads-cpulist=0-10 --measurement-cpulist=0-11`
![Figure: CPU 11 latency with load on neighboring
CPUs](https://gist.githubusercontent.com/jschwe/3502dbf1e56c85e9bf1a340041885b33/raw/962244e4e5309507feb0b4ec0627efbabe064c85/2021-02-01.png)
Note: The error bars show the unbiased standard error of the mean
Also, how many classes of service (CLOS) your CPU has? How was the cache
bitmask divided vis-a-vis the available CLOSes? And did you assign
isolated CPUs to one CLOS, and non-isolated CPUs to a different CLOS? Or
was the division more granular?
I don't have access to the system anymore, but I think it had 8 CLOS
available (according to resctrl).
I always used exclusive bitmasks. I mostly used one CLOS for the
isolated CPUs, the default CLOS, and sometimes an additional CLOS for
tid-based CAT.Due to the "exclusive" setting in resctrl I had to take
away one way of the node 0 cache, even for CLOS that were only intended
for node 1, which is a bit unfortunate.
I also tested tid-based vs. CPU based CAT on isolated CPUs and the
take-away was it doesn't matter too much:
tid based CAT visibly (negatively) impacts the best-case latencies (1
micro-second bin). However, the differences regarding the worst-case
latencies were minor.
In one test, I used CDP to reserve 4-ways (4 MiB) for each code and data
(so 8-way total) for 1 cyclictest instance (with 3 measurement threads).
For CPU-based CAT the utilization oscillated between 0.98MB and 1.11MB.
For tid-based CAT, the utilization oscillated between 98kB and 163kB.
In the next test I only used CAT to reserve 2-ways (2 MiB) shared
between code and data, also for 1 cyclictest instance with 3
measurement threads. In this case the CPU-based approach utilized
between 0.45MB and 0.85MB of the reserved L3 cache, but the latencies
measured by cyclictest were basically unchanged. The tid-based approach
actually had a utilization of 0. I'm assuming that's because more L3 was
available to the default CLOS, and the relevant cache-lines were never
evicted from that part of the L3 cache, so the reservation didn't even
come in to play there.
Kind regards,
--
Ahmed S. Darwish
Linutronix GmbH
Best regards
Jonathan Schwender