On 18/04/2024 23.00, Yosry Ahmed wrote:
On Thu, Apr 18, 2024 at 4:00 AM Jesper Dangaard Brouer<hawk@xxxxxxxxxx> wrote:
On 18/04/2024 04.21, Yosry Ahmed wrote:
On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer<hawk@xxxxxxxxxx> wrote:
This patch aims to reduce userspace-triggered pressure on the global
cgroup_rstat_lock by introducing a mechanism to limit how often reading
stat files causes cgroup rstat flushing.
[...]
Taking a step back, I think this series is trying to address two
issues in one go: interrupt handling latency and lock contention.
Yes, patch 2 and 3 are essentially independent and address two different
aspects.
While both are related because reducing flushing reduces irq
disablement, I think it would be better if we can fix that issue
separately with a more fundamental solution (e.g. using a mutex or
dropping the lock at each CPU boundary).
After that, we can more clearly evaluate the lock contention problem
with data purely about flushing latency, without taking into
consideration the irq handling problem.
Does this make sense to you?
Yes, make sense.
So, you are suggesting we start with the mutex change? (patch 2)
(which still needs some adjustments/tuning)
This make sense to me, as there are likely many solutions to reducing
the pressure on the lock.
With the tracepoint patch in-place, I/we can measure the pressure on the
lock, and I plan to do this across our CF fleet. Then we can slowly
work on improving lock contention and evaluate this on our fleets.
--Jesper
p.s.
Setting expectations:
- Going on vacation today, so will resume work after 29th April.