Re: [PATCH v3] mm: memcg: use rstat for non-hierarchical stats

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 2 Aug 2023 15:02:55 -0700

On Wed, Aug 2, 2023 at 1:11 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Wed, Aug 2, 2023 at 12:40 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> >
> > On Tue 01-08-23 10:29:39, Yosry Ahmed wrote:
> > > On Tue, Aug 1, 2023 at 9:39 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > [...]
> > > > > Have you measured any potential regression for cgroup v2 which collects
> > > > > all this data without ever using it (AFAICS)?
> > > >
> > > > I did not. I did not expect noticeable regressions given that all the
> > > > extra work is done during flushing, which should mostly be done by the
> > > > asynchronous worker, but can also happen in the stats reading context.
> > > > Let me run the same script on cgroup v2 just in case and report back.
> > >
> > > A few runs on mm-unstable with this patch:
> > >
> > > # time cat /sys/fs/cgroup/cg*/memory.stat > /dev/null
> >
> > Is this really representative test to make? I would have expected the
> > overhead would be mostly in mem_cgroup_css_rstat_flush (if it is visible
> > at all of course). This would be more likely visible in all cpus busy
> > situation (you can try heavy parallel kernel build from tmpfs for
> > example).
>
>
> I see. You are more worried about asynchronous flushing eating cpu
> time rather than the synchronous flushing being slower. In fact, my
> test is actually not representative at all because probably most of
> the cgroups either do not have updates or the asynchronous flusher got
> to them first.
>
> Let me try a workload that is more parallel & cpu intensive and report
> back. I am thinking of parallel reclaim/refault loops since both
> reclaim and refault paths invoke stat updates and stat flushing.
>

I am back with more data.

So I wrote a small reclaim/refault stress test that creates (NR_CPUS *
2) cgroups, assigns them limits, runs a worker process in each cgroup
that allocates tmpfs memory equal to quadruple the limit (to invoke
reclaim) continuously, and then reads back the entire file (to invoke
refaults). All workers are run in parallel, and zram is used as a
swapping backend. Both reclaim and refault have conditional stats
flushing. I ran this on a machine with 112 cpus, once on mm-unstable,
and once on mm-unstable with this patch reverted. The script is
attached.

(1) A few runs without this patch:

# time ./stress_reclaim_refault.sh
real 0m9.949s
user 0m0.496s
sys 14m44.974s

# time ./stress_reclaim_refault.sh
real 0m10.049s
user 0m0.486s
sys 14m55.791s

# time ./stress_reclaim_refault.sh
real 0m9.984s
user 0m0.481s
sys 14m53.841s

(2) A few runs with this patch:

# time ./stress_reclaim_refault.sh
real 0m9.885s
user 0m0.486s
sys 14m48.753s

# time ./stress_reclaim_refault.sh
real 0m9.903s
user 0m0.495s
sys 14m48.339s

# time ./stress_reclaim_refault.sh
real 0m9.861s
user 0m0.507s
sys 14m49.317s

I do not see any regressions from this patch. There is actually a very
slight improvement. If I have to guess, maybe it's because we avoid
the percpu loop in count_shadow_nodes() when calling
lruvec_page_state_local(), but I could not prove this using perf, it's
probably in the noise.

Let me know if the testing is satisfactory for you. I can send an
updated commit log accordingly with a summary of this conversation.

> > --
> > Michal Hocko
> > SUSE Labs
#!/bin/bash

NR_CPUS=$(getconf _NPROCESSORS_ONLN)
NR_CGROUPS=$(( NR_CPUS * 2 ))
TEST_MB=50
TOTAL_MB=$((TEST_MB * NR_CGROUPS))
TMPFS=$(mktemp -d)
ROOT="/sys/fs/cgroup/"
ZRAM_DEV="/mnt/devtmpfs/zram0"

cleanup() {
  umount $TMPFS
  rm -rf $TMPFS
  for i in $(seq $NR_CGROUPS); do
    cgroup="$ROOT/cg$i"
    rmdir $cgroup
  done
  swapoff $ZRAM_DEV
  echo 1 > "/sys/block/zram0/reset"
}
trap cleanup INT QUIT EXIT

# Setup zram
echo $((TOTAL_MB << 20)) > "/sys/block/zram0/disksize"
mkswap $ZRAM_DEV
swapon $ZRAM_DEV
echo "Setup zram done"

# Create cgroups, set limits
echo "+memory" > "$ROOT/cgroup.subtree_control"
for i in $(seq $NR_CGROUPS); do
  cgroup="$ROOT/cg$i"
  mkdir $cgroup
  echo $(( (TEST_MB << 20) / 4)) > "$cgroup/memory.max"
done
echo "Setup cgroups done"

# Start workers to allocate tmpfs memory
mount -t tmpfs none $TMPFS
for i in $(seq $NR_CGROUPS); do
  cgroup="$ROOT/cg$i"
  f="$TMPFS/tmp$i"
  (echo 0 > "$cgroup/cgroup.procs" &&
    dd if=/dev/zero of=$f bs=1M count=$TEST_MB status=none &&
    cat $f > /dev/null)&
done

# Wait for workers
wait