On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@xxxxxxxxxx> wrote: > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@xxxxxxxxxx> wrote: > > > > Hello, > > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > > Composefs filesystem. It is an opportunistically sharing, validating > > image-based filesystem, targeting usecases like validated ostree > > rootfs:es, validated container images that share common files, as well > > as other image based usecases. > > > > During the discussions in the composefs proposal (as seen on LWN[3]) > > is has been proposed that (with some changes to overlayfs), similar > > behaviour can be achieved by combining the overlayfs > > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > > > There are pros and cons to both these approaches, and the discussion > > about their respective value has sometimes been heated. We would like > > to have an in-person discussion at the summit, ideally also involving > > more of the filesystem development community, so that we can reach > > some consensus on what is the best apporach. > > In order to better understand the behaviour and requirements of the > overlayfs+erofs approach I spent some time implementing direct support > for erofs in libcomposefs. So, with current HEAD of > github.com/containers/composefs you can now do: > > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs > > This will produce an object store with the backing files, and a erofs > file with the required overlayfs xattrs, including a made up one > called "overlay.fs-verity" containing the expected fs-verity digest > for the lower dir. It also adds the required whiteouts to cover the > 00-ff dirs from the lower dir. > > These erofs files are ordered similarly to the composefs files, and we > give similar guarantees about their reproducibility, etc. So, they > should be apples-to-apples comparable with the composefs images. > > Given this, I ran another set of performance tests on the original cs9 > rootfs dataset, again measuring the time of `ls -lR`. I also tried to > measure the memory use like this: > > # echo 3 > /proc/sys/vm/drop_caches > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > > These are the alternatives I tried: > > xfs: the source of the image, regular dir on xfs > erofs: the image.erofs above, on loopback > erofs dio: the image.erofs above, on loopback with --direct-io=on > ovl: erofs above combined with overlayfs > ovl dio: erofs dio above combined with overlayfs > cfs: composefs mount of image.cfs > > All tests use the same objects dir, stored on xfs. The erofs and > overlay implementations are from a stock 6.1.13 kernel, and composefs > module is from github HEAD. > > I tried loopback both with and without the direct-io option, because > without direct-io enabled the kernel will double-cache the loopbacked > data, as per[1]. > > The produced images are: > 8.9M image.cfs > 11.3M image.erofs > > And gives these results: > | Cold cache | Warm cache | Mem use > | (msec) | (msec) | (mb) > -----------+------------+------------+--------- > xfs | 1449 | 442 | 54 > erofs | 700 | 391 | 45 > erofs dio | 939 | 400 | 45 > ovl | 1827 | 530 | 130 > ovl dio | 2156 | 531 | 130 > cfs | 689 | 389 | 51 It has been noted that the readahead done by kernel_read() may cause read-ahead of unrelated data into memory which skews the results in favour of workloads that consume all the filesystem metadata (such as the ls -lR usecase of the above test). In the table above this favours composefs (which uses kernel_read in some codepaths) as well as non-dio erofs (non-dio loopback device uses readahead too). I updated composefs to not use kernel_read here: https://github.com/containers/composefs/pull/105 And a new kernel patch-set based on this is available at: https://github.com/alexlarsson/linux/tree/composefs The resulting table is now (dropping the non-dio erofs): | Cold cache | Warm cache | Mem use | (msec) | (msec) | (mb) -----------+------------+------------+--------- xfs | 1449 | 442 | 54 erofs dio | 939 | 400 | 45 ovl dio | 2156 | 531 | 130 cfs | 833 | 398 | 51 | Cold cache | Warm cache | Mem use | (msec) | (msec) | (mb) -----------+------------+------------+--------- ext4 | 1135 | 394 | 54 erofs dio | 922 | 401 | 45 ovl dio | 1810 | 532 | 149 ovl lazy | 1063 | 523 | 87 cfs | 768 | 459 | 51 So, while cfs is somewhat worse now for this particular usecase, my overall analysis still stands. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@xxxxxxxxxx alexander.larsson@xxxxxxxxx