Re: [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly

Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> · Wed, 28 Jul 2021 21:29:31 -0400

On Wed, Jul 28, 2021 at 03:14:31PM -0400, J. Bruce Fields wrote:
> On Wed, Jul 28, 2021 at 08:26:12AM -0400, Neal Gompa wrote:
> > I think this is behavior people generally expect, but I wonder what
> > the consequences of this would be with huge numbers of subvolumes. If
> > there are hundreds or thousands of them (which is quite possible on
> > SUSE systems, for example, with its auto-snapshotting regime), this
> > would be a mess, wouldn't it?
> 
> I'm surprised that btrfs is special here.  Doesn't anyone have thousands
> of lvm snapshots?  Or is it that they do but they're not normally
> mounted?

Unprivileged users can't create lvm snapshots as easily or quickly as
using mkdir (well, ok, mkdir and fssync).  lvm doesn't scale very well
past more than a few dozen snapshots of the same original volume, and
performance degrades linearly in the number of snapshots if the original
LV is modified.  btrfs is the opposite:  users can create and delete
as many snapshots as they like, at a cost more expensive than mkdir but
less expensive than 'cp -a', and users only pay IO costs for writes to
the subvols they modify.  So some btrfs users use snapshots in places
where more traditional tools like 'cp -a' or 'git checkout' are used on
other filesystems.

e.g. a build system might make a snapshot of a git working tree containing
a checked out and built baseline revision, and then it might do a loop
where it makes a snapshot, applies one patch from an integration branch
in the snapshot directory, and incrementally builds there.  The next
revision makes a snapshot of its parent revision's subvol and builds
the next patch.  If there are merges in the integration branch, then
the builder can go back to parent revisions, create a new snapshot,
apply the patch, and build in a snapshot on both sides of the merge.
After testing picks a winner, the builder can simply delete all the
snapshots except the one for the version that won testing (there is no
requirement to commit the snapshot to the origin LV as in lvm, either
can be destroyed without requiring action to preserve the other).

You can do a similar thing with overlayfs, but it runs into problems
with all the mount points.  In btrfs, the mount points are persistent
because they're built into the filesystem.  With overlayfs, you have
to save and restore them so they persist across reboots (unless that
feature has been added since I last looked).

I'm looking at a few machines here, and if all the subvols are visible to
'df', its output would be somewhere around 3-5 MB.  That's too much--we'd
have to hack up df to not show the same btrfs twice...as well as every
monitoring tool that reports free space...which sounds similar to the
problems we're trying to avoid.

Ideally there would be a way to turn this on or off.  It is creating a
set of new problems that is the complement of the set we're trying to
fix in this change.

> --b.