Re: Inode limitation for overlayfs

Amir Goldstein <amir73il@xxxxxxxxx> · Fri, 27 Mar 2020 12:45:37 +0300

On Fri, Mar 27, 2020 at 8:18 AM Chengguang Xu <cgxu519@xxxxxxxxxxxx> wrote:
>
>  ---- 在 星期四, 2020-03-26 15:34:13 Amir Goldstein <amir73il@xxxxxxxxx> 撰写 ----
>  > On Thu, Mar 26, 2020 at 7:45 AM Chengguang Xu <cgxu519@xxxxxxxxxxxx> wrote:
>  > >
>  > > Hello,
>  > >
>  > > On container use case, in order to prevent inode exhaustion on host file system by particular containers,  we would like to add inode limitation for containers.
>  > > However,  current solution for inode limitation is based on project quota in specific underlying filesystem so it will also count deleted files(char type files) in overlay's upper layer.
>  > > Even worse, users may delete some lower layer files for getting more usable free inodes but the result will be opposite (consuming more inodes).
>  > >
>  > > It is somewhat different compare to disk size limitation for overlayfs, so I think maybe we can add a limit option just for new files in overlayfs. What do you think?

You are saying above that the goal is to prevent inode exhaustion on
host file system,
but you want to allow containers to modify and delete unlimited number
of lower files
thus allowing inode exhaustion. I don't see the logic is that.

Even if we only count new files and present this information on df -i
how would users be able to free up inodes when they hit the limit?
How would they know which inodes to delete?

>  >
>  > The questions are where do we store the accounting and how do we maintain them.
>  > An answer to those questions could be - in the inode index:
>  >
>  > Currently, with nfs_export=on, there is already an index dir containing:
>  > - 1 hardlink per copied up non-dir inode
>  > - 1 directory per copied-up directory
>  > - 1 whiteout per whiteout in upperdir (not an hardlink)
>  >
>
> Hi Amir,
>
> Thanks for quick response and detail information.
>
> I think the simplest way is just store accounting info in memory(maybe  in s_fs_info).
> At very first, I just thought  doing it for container use case, for container, it will be
> enough because the upper layer is always empty at starting time and will be destroyed
> at ending time.

That is not a concept that overlayfs is currently aware of.
*If* the concept is acceptable and you do implement a feature intended for this
special use case, you should verify on mount time that upperdir is empty.

>
> Adding a meta info to index dir is a  better solution for general use case but it seems
> more complicated and I'm not sure if there are other use cases concern with this problem.
> Suggestion?

docker already supports container storage quota using project quotas
on upperdir (I implemented it).
Seems like a very natural extension to also limit no. of inodes.
The problem, as you wrote it above is that project quotas
"will also count deleted files(char type files) in overlay's upper layer."
My suggestion to you was a way to account for the whiteouts separately,
so you may deduct them from total inode count.
If you are saying my suggestion is complicated, perhaps you did not
understand it.

>
>
>  > We can also make this behavior independent of nfs_export feature.
>  > In the past, I proposed the option index=all for this behavior.
>  >
>  > On mount, in ovl_indexdir_cleanup(), the index entries for file/dir/whiteout
>  > can be counted and then maintained on index add/remove.
>  >
>  > Now if you combine that with project quotas on upper/work dir, you get:
>  > <Total upper/work inodes> = <pure upper inodes> + <non-dir index count> +
>  >                                            2*<dir index count> +
>  > 2*<whiteout index count>
>
> I'm not clear what the exact relationships between those indexes and nfs_export

nfs_export feature reuiqres index_all, but we did not have a reason (yet) to
add an option to enable index_all without enabling nfs_export:

/* Index all files on copy up. For now only enabled for NFS export */
bool ovl_index_all(struct super_block *sb)
{
        struct ovl_fs *ofs = sb->s_fs_info;

        return ofs->config.nfs_export && ofs->config.index;
}

> but  if possible I hope having  separated switches for every index functions and a total
> switch(index=all) to enable all index functions at same time.
>

FYI, index_all stands for "index all modified/deleted lower files (and dirs)"
At the moment, the only limitation of nfs_export=on that could be relaxed with
index=all is that nfs_export=on is mutually exclusive with metacopy=on.
index=all will not have this limitation.

>  >
>  > Assuming that you know the total from project quotas and the index counts
>  > from overlayfs, you can calculate total pure upper.
>  >
>  > Now you *can* implement upper inodes quota within overlayfs, but you
>  > can also do that without changing overlayfs at all assuming you can
>  > allow some slack in quota enforcement -
>  > periodically scan the index dir and adjust project quota limits.
>
> Dynamically changing inode limit  looks  too complicated to implement in management system
> and having different quota limit during lifetime for same container may cause confusion to sys admins.
> So I still hope to solve this problem on overlayfs layer.
>

To me that sounds like shifting complexity from system to kernel
for not a good enough reason and with loosing flexibility.
You are proposing a heuristic solution anyway because it is inherently
not immune against DoS of a malicious container that does rm -rf *.
So to me it makes more sense to deal with that logic in container
management level, where more heuristics can be applied, for example:
Allow to add up to X new files, modify %Y files from lower and
delete %Z files from lower.

Note that container management does not have to adjust project quota
limits periodically, it only needs to re-calculate and adjust project quota
limits when user gets out of quota warning.
I believe there are already mechanisms in Linux quota management to
notify management software of quota limit expiry in order to take action,
but I am not that familiar with those mechanisms.

Thanks,
Amir.