On Wed, Jul 28, 2021 at 03:14:31PM -0400, J. Bruce Fields wrote: > On Wed, Jul 28, 2021 at 08:26:12AM -0400, Neal Gompa wrote: > > I think this is behavior people generally expect, but I wonder what > > the consequences of this would be with huge numbers of subvolumes. If > > there are hundreds or thousands of them (which is quite possible on > > SUSE systems, for example, with its auto-snapshotting regime), this > > would be a mess, wouldn't it? > > I'm surprised that btrfs is special here. Doesn't anyone have thousands > of lvm snapshots? Or is it that they do but they're not normally > mounted? Unprivileged users can't create lvm snapshots as easily or quickly as using mkdir (well, ok, mkdir and fssync). lvm doesn't scale very well past more than a few dozen snapshots of the same original volume, and performance degrades linearly in the number of snapshots if the original LV is modified. btrfs is the opposite: users can create and delete as many snapshots as they like, at a cost more expensive than mkdir but less expensive than 'cp -a', and users only pay IO costs for writes to the subvols they modify. So some btrfs users use snapshots in places where more traditional tools like 'cp -a' or 'git checkout' are used on other filesystems. e.g. a build system might make a snapshot of a git working tree containing a checked out and built baseline revision, and then it might do a loop where it makes a snapshot, applies one patch from an integration branch in the snapshot directory, and incrementally builds there. The next revision makes a snapshot of its parent revision's subvol and builds the next patch. If there are merges in the integration branch, then the builder can go back to parent revisions, create a new snapshot, apply the patch, and build in a snapshot on both sides of the merge. After testing picks a winner, the builder can simply delete all the snapshots except the one for the version that won testing (there is no requirement to commit the snapshot to the origin LV as in lvm, either can be destroyed without requiring action to preserve the other). You can do a similar thing with overlayfs, but it runs into problems with all the mount points. In btrfs, the mount points are persistent because they're built into the filesystem. With overlayfs, you have to save and restore them so they persist across reboots (unless that feature has been added since I last looked). I'm looking at a few machines here, and if all the subvols are visible to 'df', its output would be somewhere around 3-5 MB. That's too much--we'd have to hack up df to not show the same btrfs twice...as well as every monitoring tool that reports free space...which sounds similar to the problems we're trying to avoid. Ideally there would be a way to turn this on or off. It is creating a set of new problems that is the complement of the set we're trying to fix in this change. > --b.