On 9/10/13 11:56 AM, Josef Bacik wrote: > On Tue, Sep 10, 2013 at 08:36:55AM -0700, Mark Fasheh wrote: >> On Mon, Aug 12, 2013 at 04:47:52AM -0700, Christoph Hellwig wrote: >>> On Thu, Aug 08, 2013 at 11:44:54AM -0400, Josef Bacik wrote: >>>> On Thu, Aug 08, 2013 at 06:48:05AM -0700, Christoph Hellwig wrote: >>>>> On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote: >>>>>> This won't work, try having 10000 subvolumes with dirty inodes and do sync then >>>>>> go skiing, you'll have time :). Thanks, >>>>> >>>>> Why would the dirty inodes make any difference? If you share the bdi >>>>> between the subvolumes the sync workflow should be exactly the same >>>>> still. >>>>> >>>> >>>> If we could dis-entangle vfsmounts from sb's and have it so you could have >>>> multiple vfsmounts with just one sb that would solve at least the in-kernel >>>> confusion, but I think we still have the userspace confusion. Thanks, >>> >>> I think it would mostly solve userspace confusion, as userspace only >>> sees mounts and the device names. >>> >>> But please fix this up properly instead of propagating the effects of >>> the nasty btrfs hack that should never have been merged in that form >>> further up the stack. >> >> Can one of you explain how this solves the problem that userspace is getting >> different devices for the same inode? >> >> Seriously, I've been looking into it and I'm a bit lost. I followed the >> converstaion until here but I don't see how any of the proposed changes >> actually *fix* anything? Also, what is the relationship between vfsmounts >> and sb today? Wouldn't a bind mount produce the situation of more than 1 >> vfsmount per sb that is described above? >> >> Sincerely, someone who would like to fix this ABI breakage that has been >> going on for years. > > And let me restate the problem so we're all on the same page. > > Btrfs has subvolumes, completely separate trees within the file system. These > trees get their own object numbering, which in turn is how we do our inode > numbers. So if you have multiple subvolumes, they will likely have the same > inode numbers within the same file system. This screws up things like rsync > which say "hey look, these two inodes are the same, lets skip them." So we have > an anonymous dev so we can make them look different. > > Now if we were to make each subvol its own vfsmount (essentially a bind mount) > and remove the anonymous device that wouldn't fix the problem _at all_. The > file system would appear to be the same to rsync and it wouldn't back stuff up. > So we still need some way of telling userspace that this object is different. > > I'm not convinced vfsmounts is the way to do this, it doesn't do anything other > than add a whole lot of complexity to our mounting/subvolume mechanism that is > already relatively complex. Thanks, Agreed. It's hugely wasteful as well. We can have thousands of subvolumes even on modest systems like workstations when automated snapshots are involved. Using a vfsmount for each subvolume would make /proc/mounts pretty useless. Having a separate superblock for each one, at 1k a pop, would waste a ton of memory considering that they'll be identical except for the dev_t. The only way vfsmounts would work is if we added a dev_t there, which would usually be set to ->mnt_sb->s_dev except for the btrfs case. That still doesn't solve the polluted /proc/mounts, though. -Jeff -- Jeff Mahoney SUSE Labs
Attachment:
signature.asc
Description: OpenPGP digital signature