On 01/30/2013 03:19 AM, Jan Kara wrote: > Hi Jeff, > > On Wed 30-01-13 00:37:08, Jeff Liu wrote: >> On 01/29/2013 11:14 PM, Jan Kara wrote: >>> Hello, >>> >>> On Tue 29-01-13 22:44:24, Jeff Liu wrote: >>>> I'd like to discuss the following problems on LSF: >>>> >>>> - Container UID/GID quota support >>>> About more than half year ago, I have posted a patch set about support UID/GID >>>> quota inside containers: >>>> http://www.spinics.net/lists/linux-containers/msg25393.html >>>> >>>> However, I have to put it on ice at that time since this feature is depend on the >>>> user namespace. Now I think it's time to bring it up because the user_ns was >>>> basically done on 3.8-rcX. >>>> >>>> Combine with user_ns, there would have a couple of issues need to be solved at first: >>>> 1) UID/GID mapping between global and containers quota files. >>>> On my previous implementation, the quotas are cached in memory that is truely can not >>>> be accepted at all, I'll try to make it as usual with journalling quota support. >>>> >>>> 2) To avoid modifying the quota tools, maybe we have to make quotas enabled all the >>>> time inside containers so that the end user would just set up quota limits or won't. >>>> >>>> 3) Embed container quota accounting related logic into the corresponding VFS quota >>>> routines and make it transparent for the outside file systems. >>> So now looking into your old submission, your main aim was to make >>> quota-tools work properly when run from inside a container, right? >> Right. >>> Because quota enforcement works properly once user namespaces are in place. In fact >>> quota calls such as Q_GETQUOTA or Q_SETQUOTA work correctly as well with >>> user namespaces. UID/GID translation from namespace id space to the >>> global space and back is already happening. So what functionality are you >>> missing? >> So looks like there is no need to revisit it.:( >> Previously I found that we can not turn quota off insides containers without modifying >> the quota tools, I am not sure this sounds make sense or not, or is this a fair user >> requirements. Anyway, I'll play with the user namespace with quota tools for further >> investigations. > So turning quotas on/off is a filesystem global action. As such it's hard > to make it work from containers when you don't have fs-per-container > setup... Implementing something like per-namespace quota enforcement (i.e. > only processes from a particular namespace will not be allowed to exceed > quota) might be reasonably possible though - you would just need to tweak > sb_has_quota_limits_enabled() function to take also current namespace into > account. Yep, let me give a try. > >>>> - Introduce a new whence to lseek(2) to fetch the reflinked/sharing extents >>>> >>>> We have some user requests about showing the real disk footprint with OCFS2 reflinked >>>> or Btrfs cloned files. I had written a shared-du utility based on du(1) for OCFS2 as >>>> this is the only file system with reflink supports at that time: >>>> https://oss.oracle.com/pipermail/ocfs2-devel/2010-September/007293.html >>> But this is a though problem, isn't it? You have to minimally cache some >>> info about *every* file du(1) was called on so that you can check whether >>> two files share some extents or not. I'm not saying it isn't a useful >>> functionality, just I'd like to verify we are on the same page. >> Yes, from the user land, I have to cache the shared extents info, and >> iterate the cached item to examine if the next one to be cached is >> already exists or not. If exits, increase the count number and check the >> next one...otherwise, cache it, and repeat this step again and again >> until all the files resides on the target partition/directories were >> checked. > Yes, that's what I'd imagine. > >>>> It based on FIEMAP ioctl(2) on the user space, and OCFS2 using FIEMAP_EXTENT_SHARED >>>> flag to indicate an extent is reflinked/cow when the internal OCFS2_EXT_REFCOUNTED >>>> flag is detected. >>>> >>>> Recently, I have started to implement this feature on Btrfs in a similar approach. >>>> Once it completed, the next thing is to teach upstream du(1) works for both file >>>> systems with a new command option. >>>> >>>> Still sounds nothing because we have FIEMAP...:( But consider the bad interface >>>> and error prone when I improving cp(1) through it for sparse files, it will extends >>>> the ugly tentacles of FIEMAP into du(1) again that the maintainer of coreutils(Jim, CC-ed) >>>> don't like it at all, and I also want to avoid if possible... >>>> >>>> How about if we add a new whence type to lseek(2) for this function? lseek has very clear >>>> interface and works very well for SEEK_DATA/SEEK_HOLE, most likely could works fine for >>>> shared extents IMHO. >>> Well, I can hardly imagine how such lseek(2) interface would look to be >>> useful for identifying shared extents among different files. Do you have >>> something particular in mind? >> lseek(2) is not used for identifying shared extents among files. It >> would be improved and called to find out and return an desired extent >> which is reflinked or cloned with a particular whence, the underlying >> file system should be improved accordingly. >> >> To say Btrfs, if we performed btrfs_ioctl_clone from source file A to >> target B, run du(1) against both files, it would show double space >> although only 1/2 space is really used/reserved upon COW. >> >> If we can mark the cloned extents of file with a special flag(to say >> EXTENT_MAP_CLONED), then call lseek(fd, offset, SEEK_CLONE or ?), it >> would return the offset of a cloned extent which is equal or beyond the >> given offset, so we can find out all the cloned extents upon a file which >> would be used for the disk space accounting in user space tools. > OK, but then you have to call FIEMAP anyway to find which blocks are > underlying the extent so that you can match that with cloned extents from > different files. Yes, I have to call FIEMAP as the user space end up checking the physical offset for the start of an extent. :( > Ah, and the advantage would be that you don't have to > cache *all* the extents but only those that are reported as reflinked. Yess! Thanks, -Jeff > OK, now I see. > > Honza > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html