on 1/30/2024 5:00 AM, Tejun Heo wrote: > Hello, > > On Wed, Jan 24, 2024 at 10:01:47AM +0800, Kemeng Shi wrote: >> Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb >> domain and a cgroup domain. I agree the way how we calculate wb's threshold >> in global domain as you described above. This patch tries to fix calculation >> of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb, >> mdtc->bg_thresh)), means: >> (wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*) >> The cgroup domain threshold is >> (memory of cgroup domain) / (memory of system) * (system threshold). >> Then the wb's threshold in cgroup will be smaller than expected. >> >> Consider following domain hierarchy: >> global domain (100G) >> / \ >> cgroup domain1(50G) cgroup domain2(50G) >> | | >> bdi wb1 wb2 >> Assume wb1 and wb2 has the same bandwidth. >> We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G. >> Then we have: >> wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth) >> = 10G * 1/2 = 5G >> wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth) >> = 5G * 1/2 = 2.5G >> At last, wb1 and wb2 will be limited at 2.5G, the system will be limited >> at 5G which is less than global domain bg_thresh 10G. >> >> After the fix, threshold in cgroup domain will be: >> (wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold) >> The wb1 and wb2 will be limited at 5G, the system will be limited at >> 10G which equals to global domain bg_thresh 10G. >> >> As I didn't take a deep look into memory cgroup, please correct me if >> anything is wrong. Thanks! >>> Also, how is this tested? Was there a case where the existing code >>> misbehaved that's improved by this patch? Or is this just from reading code? >> >> This is jut from reading code. Would the case showed above convince you >> a bit. Look forward to your reply, thanks!. > > So, the explanation makes some sense to me but can you please construct a > case to actually demonstrate the problem and fix? I don't think it'd be wise > to apply the change without actually observing the code change does what it > says it does. Hi Tejun, sorry for the delay as I found there is a issue that keep triggering writeback even the dirty page is under dirty background threshold. The issue make it difficult to observe the expected improvment from this patch. I try to fix it in [1] and test this patch based on the fix patches. Run test as following: /* make background writeback easier to observe */ echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 100 > /proc/sys/vm/dirty_writeback_centisecs /* enable memory and io cgroup */ echo "+memory +io" > /sys/fs/cgroup/cgroup.subtree_control /* run fio in group1 with shell */ cd /sys/fs/cgroup mkdir group1 cd group1 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdb mount /dev/vdb /bdi1/ fio -name test -filename=/bdi1/file -size=800M -ioengine=libaio -bs=4K -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0 /* run another fio in group2 with another shell */ cd /sys/fs/cgroup mkdir group2 cd group2 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdc mount /dev/vdc /bdi2/ fio -name test -filename=/bdi2/file -size=800M -ioengine=libaio -bs=4K -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0 Before the fix we got (result of three tests): fio1 WRITE: bw=1304MiB/s (1367MB/s), 1304MiB/s-1304MiB/s (1367MB/s-1367MB/s), io=76.4GiB (82.0GB), run=60001-60001msec WRITE: bw=1351MiB/s (1417MB/s), 1351MiB/s-1351MiB/s (1417MB/s-1417MB/s), io=79.2GiB (85.0GB), run=60001-60001msec WRITE: bw=1373MiB/s (1440MB/s), 1373MiB/s-1373MiB/s (1440MB/s-1440MB/s), io=80.5GiB (86.4GB), run=60001-60001msec fio2 WRITE: bw=1134MiB/s (1190MB/s), 1134MiB/s-1134MiB/s (1190MB/s-1190MB/s), io=66.5GiB (71.4GB), run=60001-60001msec WRITE: bw=1414MiB/s (1483MB/s), 1414MiB/s-1414MiB/s (1483MB/s-1483MB/s), io=82.8GiB (88.0GB), run=60001-60001msec WRITE: bw=1469MiB/s (1540MB/s), 1469MiB/s-1469MiB/s (1540MB/s-1540MB/s), io=86.0GiB (92.4GB), run=60001-60001msec After the fix we got (result of three tests): fio1 WRITE: bw=1719MiB/s (1802MB/s), 1719MiB/s-1719MiB/s (1802MB/s-1802MB/s), io=101GiB (108GB), run=60001-60001msec WRITE: bw=1723MiB/s (1806MB/s), 1723MiB/s-1723MiB/s (1806MB/s-1806MB/s), io=101GiB (108GB), run=60001-60001msec WRITE: bw=1691MiB/s (1774MB/s), 1691MiB/s-1691MiB/s (1774MB/s-1774MB/s), io=99.2GiB (106GB), run=60036-60036msec fio2 WRITE: bw=1692MiB/s (1774MB/s), 1692MiB/s-1692MiB/s (1774MB/s-1774MB/s), io=99.1GiB (106GB), run=60001-60001msec WRITE: bw=1681MiB/s (1763MB/s), 1681MiB/s-1681MiB/s (1763MB/s-1763MB/s), io=98.5GiB (106GB), run=60001-60001msec WRITE: bw=1671MiB/s (1752MB/s), 1671MiB/s-1671MiB/s (1752MB/s-1752MB/s), io=97.9GiB (105GB), run=60001-60001msec I also add code to print the pages written in writeback and pages written in writeback reduce a lot and are rare after this fix. [1] https://lore.kernel.org/linux-fsdevel/20240208172024.23625-2-shikemeng@xxxxxxxxxxxxxxx/T/#u > > Thanks. >