On Wed, Aug 2, 2023 at 1:11 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Wed, Aug 2, 2023 at 12:40 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > On Tue 01-08-23 10:29:39, Yosry Ahmed wrote: > > > On Tue, Aug 1, 2023 at 9:39 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > [...] > > > > > Have you measured any potential regression for cgroup v2 which collects > > > > > all this data without ever using it (AFAICS)? > > > > > > > > I did not. I did not expect noticeable regressions given that all the > > > > extra work is done during flushing, which should mostly be done by the > > > > asynchronous worker, but can also happen in the stats reading context. > > > > Let me run the same script on cgroup v2 just in case and report back. > > > > > > A few runs on mm-unstable with this patch: > > > > > > # time cat /sys/fs/cgroup/cg*/memory.stat > /dev/null > > > > Is this really representative test to make? I would have expected the > > overhead would be mostly in mem_cgroup_css_rstat_flush (if it is visible > > at all of course). This would be more likely visible in all cpus busy > > situation (you can try heavy parallel kernel build from tmpfs for > > example). > > > I see. You are more worried about asynchronous flushing eating cpu > time rather than the synchronous flushing being slower. In fact, my > test is actually not representative at all because probably most of > the cgroups either do not have updates or the asynchronous flusher got > to them first. > > Let me try a workload that is more parallel & cpu intensive and report > back. I am thinking of parallel reclaim/refault loops since both > reclaim and refault paths invoke stat updates and stat flushing. > I am back with more data. So I wrote a small reclaim/refault stress test that creates (NR_CPUS * 2) cgroups, assigns them limits, runs a worker process in each cgroup that allocates tmpfs memory equal to quadruple the limit (to invoke reclaim) continuously, and then reads back the entire file (to invoke refaults). All workers are run in parallel, and zram is used as a swapping backend. Both reclaim and refault have conditional stats flushing. I ran this on a machine with 112 cpus, once on mm-unstable, and once on mm-unstable with this patch reverted. The script is attached. (1) A few runs without this patch: # time ./stress_reclaim_refault.sh real 0m9.949s user 0m0.496s sys 14m44.974s # time ./stress_reclaim_refault.sh real 0m10.049s user 0m0.486s sys 14m55.791s # time ./stress_reclaim_refault.sh real 0m9.984s user 0m0.481s sys 14m53.841s (2) A few runs with this patch: # time ./stress_reclaim_refault.sh real 0m9.885s user 0m0.486s sys 14m48.753s # time ./stress_reclaim_refault.sh real 0m9.903s user 0m0.495s sys 14m48.339s # time ./stress_reclaim_refault.sh real 0m9.861s user 0m0.507s sys 14m49.317s I do not see any regressions from this patch. There is actually a very slight improvement. If I have to guess, maybe it's because we avoid the percpu loop in count_shadow_nodes() when calling lruvec_page_state_local(), but I could not prove this using perf, it's probably in the noise. Let me know if the testing is satisfactory for you. I can send an updated commit log accordingly with a summary of this conversation. > > -- > > Michal Hocko > > SUSE Labs
#!/bin/bash NR_CPUS=$(getconf _NPROCESSORS_ONLN) NR_CGROUPS=$(( NR_CPUS * 2 )) TEST_MB=50 TOTAL_MB=$((TEST_MB * NR_CGROUPS)) TMPFS=$(mktemp -d) ROOT="/sys/fs/cgroup/" ZRAM_DEV="/mnt/devtmpfs/zram0" cleanup() { umount $TMPFS rm -rf $TMPFS for i in $(seq $NR_CGROUPS); do cgroup="$ROOT/cg$i" rmdir $cgroup done swapoff $ZRAM_DEV echo 1 > "/sys/block/zram0/reset" } trap cleanup INT QUIT EXIT # Setup zram echo $((TOTAL_MB << 20)) > "/sys/block/zram0/disksize" mkswap $ZRAM_DEV swapon $ZRAM_DEV echo "Setup zram done" # Create cgroups, set limits echo "+memory" > "$ROOT/cgroup.subtree_control" for i in $(seq $NR_CGROUPS); do cgroup="$ROOT/cg$i" mkdir $cgroup echo $(( (TEST_MB << 20) / 4)) > "$cgroup/memory.max" done echo "Setup cgroups done" # Start workers to allocate tmpfs memory mount -t tmpfs none $TMPFS for i in $(seq $NR_CGROUPS); do cgroup="$ROOT/cg$i" f="$TMPFS/tmp$i" (echo 0 > "$cgroup/cgroup.procs" && dd if=/dev/zero of=$f bs=1M count=$TEST_MB status=none && cat $f > /dev/null)& done # Wait for workers wait