Re: [LSF/MM TOPIC] [ATTEND] Container disk quota and lseek(2) upon shared extents

Jeff Liu <jeff.liu@xxxxxxxxxx> · Wed, 30 Jan 2013 11:49:11 +0800



On 01/30/2013 03:19 AM, Jan Kara wrote:
>   Hi Jeff,
> 
> On Wed 30-01-13 00:37:08, Jeff Liu wrote:
>> On 01/29/2013 11:14 PM, Jan Kara wrote:
>>>   Hello,
>>>
>>> On Tue 29-01-13 22:44:24, Jeff Liu wrote:
>>>> I'd like to discuss the following problems on LSF:
>>>>
>>>> - Container UID/GID quota support
>>>> About more than half year ago, I have posted a patch set about support UID/GID
>>>> quota inside containers:
>>>> http://www.spinics.net/lists/linux-containers/msg25393.html
>>>>
>>>> However, I have to put it on ice at that time since this feature is depend on the
>>>> user namespace.  Now I think it's time to bring it up because the user_ns was
>>>> basically done on 3.8-rcX.
>>>>
>>>> Combine with user_ns, there would have a couple of issues need to be solved at first:
>>>> 1) UID/GID mapping between global and containers quota files.
>>>> On my previous implementation, the quotas are cached in memory that is truely can not
>>>> be accepted at all,  I'll try to make it as usual with journalling quota support.
>>>>  
>>>> 2) To avoid modifying the quota tools, maybe we have to make quotas enabled all the
>>>> time inside containers so that the end user would just set up quota limits or won't.
>>>>
>>>> 3) Embed container quota accounting related logic into the corresponding VFS quota
>>>> routines and make it transparent for the outside file systems.  
>>>   So now looking into your old submission, your main aim was to make
>>> quota-tools work properly when run from inside a container, right?
>> Right. 
>>> Because quota enforcement works properly once user namespaces are in place. In fact
>>> quota calls such as Q_GETQUOTA or Q_SETQUOTA work correctly as well with
>>> user namespaces. UID/GID translation from namespace id space to the
>>> global space and back is already happening. So what functionality are you
>>> missing?
>> So looks like there is no need to revisit it.:(
>> Previously I found that we can not turn quota off insides containers without modifying
>> the quota tools, I am not sure this sounds make sense or not, or is this a fair user
>> requirements.  Anyway, I'll play with the user namespace with quota tools for further
>> investigations. 
>   So turning quotas on/off is a filesystem global action. As such it's hard
> to make it work from containers when you don't have fs-per-container
> setup... Implementing something like per-namespace quota enforcement (i.e.
> only processes from a particular namespace will not be allowed to exceed
> quota) might be reasonably possible though - you would just need to tweak
> sb_has_quota_limits_enabled() function to take also current namespace into
> account.
Yep, let me give a try.
> 
>>>> - Introduce a new whence to lseek(2) to fetch the reflinked/sharing extents
>>>>
>>>> We have some user requests about showing the real disk footprint with OCFS2 reflinked
>>>> or Btrfs cloned files.  I had written a shared-du utility based on du(1) for OCFS2 as
>>>> this is the only file system with reflink supports at that time:
>>>> https://oss.oracle.com/pipermail/ocfs2-devel/2010-September/007293.html
>>>   But this is a though problem, isn't it? You have to minimally cache some
>>> info about *every* file du(1) was called on so that you can check whether
>>> two files share some extents or not. I'm not saying it isn't a useful
>>> functionality, just I'd like to verify we are on the same page.
>> Yes, from the user land, I have to cache the shared extents info, and
>> iterate the cached item to examine if the next one to be cached is
>> already exists or not.  If exits, increase the count number and check the
>> next one...otherwise, cache it, and repeat this step again and again
>> until all the files resides on the target partition/directories were
>> checked.
>   Yes, that's what I'd imagine.
> 
>>>> It based on FIEMAP ioctl(2) on the user space, and OCFS2 using FIEMAP_EXTENT_SHARED
>>>> flag to indicate an extent is reflinked/cow when the internal OCFS2_EXT_REFCOUNTED
>>>> flag is detected.
>>>>
>>>> Recently, I have started to implement this feature on Btrfs in a similar approach.
>>>> Once it completed, the next thing is to teach upstream du(1) works for both file
>>>> systems with a new command option.
>>>>
>>>> Still sounds nothing because we have FIEMAP...:( But consider the bad interface
>>>> and error prone when I improving cp(1) through it for sparse files, it will extends
>>>> the ugly tentacles of FIEMAP into du(1) again that the maintainer of coreutils(Jim, CC-ed)
>>>> don't like it at all, and I also want to avoid if possible...
>>>>
>>>> How about if we add a new whence type to lseek(2) for this function?  lseek has very clear
>>>> interface and works very well for SEEK_DATA/SEEK_HOLE, most likely could works fine for
>>>> shared extents IMHO.
>>>   Well, I can hardly imagine how such lseek(2) interface would look to be
>>> useful for identifying shared extents among different files. Do you have
>>> something particular in mind?
>> lseek(2) is not used for identifying shared extents among files.  It
>> would be improved and called to find out and return an desired extent
>> which is reflinked or cloned with a particular whence, the underlying
>> file system should be improved accordingly.
>>
>> To say Btrfs, if we performed btrfs_ioctl_clone from source file A to
>> target B, run du(1) against both files, it would show double space
>> although only 1/2 space is really used/reserved upon COW.
>>
>> If we can mark the cloned extents of file with a special flag(to say
>> EXTENT_MAP_CLONED), then call lseek(fd, offset, SEEK_CLONE or ?), it
>> would return the offset of a cloned extent which is equal or beyond the
>> given offset, so we can find out all the cloned extents upon a file which
>> would be used for the disk space accounting in user space tools.
>   OK, but then you have to call FIEMAP anyway to find which blocks are
> underlying the extent so that you can match that with cloned extents from
> different files.
Yes,  I have to call FIEMAP as the user space end up checking the
physical offset for the start of an extent. :(
> Ah, and the advantage would be that you don't have to
> cache *all* the extents but only those that are reported as reflinked.
Yess!

Thanks,
-Jeff
> OK, now I see.
> 
> 								Honza
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html