Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat

Jeff Mahoney <jeffm@xxxxxxxx> · Tue, 10 Sep 2013 17:21:54 -0400

On 9/10/13 11:56 AM, Josef Bacik wrote:
> On Tue, Sep 10, 2013 at 08:36:55AM -0700, Mark Fasheh wrote:
>> On Mon, Aug 12, 2013 at 04:47:52AM -0700, Christoph Hellwig wrote:
>>> On Thu, Aug 08, 2013 at 11:44:54AM -0400, Josef Bacik wrote:
>>>> On Thu, Aug 08, 2013 at 06:48:05AM -0700, Christoph Hellwig wrote:
>>>>> On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote:
>>>>>> This won't work, try having 10000 subvolumes with dirty inodes and do sync then
>>>>>> go skiing, you'll have time :).  Thanks,
>>>>>
>>>>> Why would the dirty inodes make any difference?  If you share the bdi
>>>>> between the subvolumes the sync workflow should be exactly the same
>>>>> still.
>>>>>
>>>>
>>>> If we could dis-entangle vfsmounts from sb's and have it so you could have
>>>> multiple vfsmounts with just one sb that would solve at least the in-kernel
>>>> confusion, but I think we still have the userspace confusion.  Thanks,
>>>
>>> I think it would mostly solve userspace confusion, as userspace only
>>> sees mounts and the device names.
>>>
>>> But please fix this up properly instead of propagating the effects of
>>> the nasty btrfs hack that should never have been merged in that form
>>> further up the stack.
>>
>> Can one of you explain how this solves the problem that userspace is getting
>> different devices for the same inode?
>>
>> Seriously, I've been looking into it and I'm a bit lost. I followed the
>> converstaion until here but I don't see how any of the proposed changes
>> actually *fix* anything? Also, what is the relationship between vfsmounts
>> and sb today? Wouldn't a bind mount produce the situation of more than 1
>> vfsmount per sb that is described above?
>>
>> Sincerely, someone who would like to fix this ABI breakage that has been
>> going on for years.
> 
> And let me restate the problem so we're all on the same page.
> 
> Btrfs has subvolumes, completely separate trees within the file system.  These
> trees get their own object numbering, which in turn is how we do our inode
> numbers.  So if you have multiple subvolumes, they will likely have the same
> inode numbers within the same file system.  This screws up things like rsync
> which say "hey look, these two inodes are the same, lets skip them."  So we have
> an anonymous dev so we can make them look different.
> 
> Now if we were to make each subvol its own vfsmount (essentially a bind mount)
> and remove the anonymous device that wouldn't fix the problem _at all_.  The
> file system would appear to be the same to rsync and it wouldn't back stuff up.
> So we still need some way of telling userspace that this object is different.
> 
> I'm not convinced vfsmounts is the way to do this, it doesn't do anything other
> than add a whole lot of complexity to our mounting/subvolume mechanism that is
> already relatively complex.  Thanks,

Agreed. It's hugely wasteful as well. We can have thousands of
subvolumes even on modest systems like workstations when automated
snapshots are involved. Using a vfsmount for each subvolume would make
/proc/mounts pretty useless. Having a separate superblock for each one,
at 1k a pop, would waste a ton of memory considering that they'll be
identical except for the dev_t.

The only way vfsmounts would work is if we added a dev_t there, which
would usually be set to ->mnt_sb->s_dev except for the btrfs case. That
still doesn't solve the polluted /proc/mounts, though.

-Jeff

-- 
Jeff Mahoney
SUSE Labs

Attachment:
signature.asc

Description: OpenPGP digital signature