On 1/8/20 9:06 AM, Vivek Goyal wrote: > On Wed, Jan 08, 2020 at 09:27:12AM +0200, Amir Goldstein wrote: >> [-fsdevel,+containers] >> >>> On Thu, Apr 18, 2019 at 1:58 PM StuartIanNaylor <rolyantrauts@xxxxxxxxx> wrote: >>>> Apols to ask here but are there any tools for overlayFS? >>>> >>>> https://github.com/kmxz/overlayfs-tools is just about the only thing I >>>> can find. >>> There is also https://github.com/hisilicon/overlayfs-progs which >>> can check and fix overlay layers, but it hasn't been updated in a while. >>> >> Hi Vivek (and containers folks), >> >> Stuart has pinged me on https://github.com/StuartIanNaylor/zram-config/issues/4 >> to ask about the status of overlayfs offline tools. >> >> Quoting my answer here for visibility to more container developers: >> >> I have been involved with implementing many overlayfs features in the >> kernel in the >> past couple of years (redirect_dir,index,nfs_export,xino,metacopy). >> All of these features bring benefits to end users, but AFAIK, they are >> all still disabled >> by default in containers runtimes (?) because lack of tools support >> (e.g. migration >> /import/export). I cannot force anyone to use the new overlayfs >> features nor to write >> offline tools support for them. >> >> So how can we improve this situation? >> >> If the problem is development resources then I've had great experience >> in the past >> with OSS internship programs like Google summer of code (GSoC): >> Organizations, such as Redhat or mobyproject.org, can participate in the program >> by posting proposals for open source projects. >> Developers, such as myself, volunteer to mentors projects and students apply >> to work on them. >> >> IIRC, the timeline for GSoC for project proposals in around April. Applying as >> an organization could be before that. >> >> Vivek, since you are the only developer I know involved in containers runtime >> projects I am asking you, but really its a question for all container developers >> out there. >> >> Are you aware of missing features in containers that could be met by filling the >> gaps with overlayfs offline tools? > CCing Dan Walsh as he is taking care of podman and often I hear some of > the the complaints from him w.r.t what he thinks is missing. This is > not necessarily related to overlayfs offline tools. > > - Unpriviliged mounting of overlayfs. > > He wants to launch containers unpriviliged and hence wants to be able > to mount overlayfs without being root in init_user_ns. I think Miklos > posted some patches for that but not much progress after that. > > https://patchwork.kernel.org/cover/11212091/ > > - shiftfs > > As of now they are relying on doing chown of the image but will really > like to see the ability to shift uid/gids using shiftfs or using > VFS layer solution. > > - Overlayfs redirect_dir is not compatible with image building > > redirect_dir is not compatible with image building and I think that's > one reason that its not used by default. And as metacopy is dependent > on redirect_dir, its not used by default as well. It can be used for > running containers though, but one needs to know that in advacnce. > > So it will be good if that's fixed with redirect_dir and metacopy > features and then there is higher chance that these features are > enabled by default. > > Miklos had some ides on how to tackle the issue of getting diff > correctly with redirect_dir enabled. > > https://www.spinics.net/lists/linux-unionfs/msg06969.html > > Having said that, I think Dan Walsh has enabled metacopy by default > in podman in certain configurations (for running containers and not > for building images). > > Thanks > Vivek Amir, Vivek did an excellent job of describing what we are attempting to do with OverlayFS in container tools. My work centers around github.com/containers Specifically in podman(libpod), buildah, CRI-O, Skopeo, containers/storage and containers/image. The Podman tool is our most popular tool and runs containers with metacopyup turned on by default, in at least Fedora and soon in RHEL8. Not sure if it is turned on by default in Debian and Ubuntu releases, as well as OpenSUSE and other distros. On of the biggest features of these container engines (runtimes) is that podman & Buildah can run rootless, using the user namespace. But sadly we can not use overlayfs for this, since mounting of overlayfs requires CAP_SYS_ADMIN. As Vivek points out, Miklos is working to fix this. For now we use a FUSE version of overlay called fuse_overlayfs, which can run rootless, but might not give us as good of performance as kernel overlayfs. The biggest feature I want to push for in container technologies is better support for User Namespace. I want to use it for container separation, IE Each container would run with a different User Namespace. This means that root in one container would be a different UID then Root is a different container. Currently almost no one uses User Namespace for this kind of separation. The difficulty is that the kernel does not support a shifting file system, so if I want to share the same base image image, (Lower directory) between multiple containers in different User Namespaces, the UIDs end up wrong. We have hoped for a shifting file system for many years, but Overlay FS has never developed it, (Fuse-overlay has some support for it). There is an effort in the kernel now to add a shifting file system, but I would bet this will take a long time to get implemented. The other option that we have built into our container engines is a "chowing" image. Basically when a new container is started, in a new User Namespace, the container engine chowns the lower level to match the new user namespace and then sets up an overlay mount. If the same image is used a second time, the container engine is smart enough to use the "chowned" image. This chowning causes two problems on traditional Overlay systems. One it is slow, since it is copying up all of the lower files to a new upper. The second problem is now the kernel sees each executable/shared library as being different so process/memory sharing is broken in the kernel. This means I get less containers running on a system do to memory. The metacopyup feature of overlay solves both of these issues. This is why we turn it on by default in Podman. If I run podman in a new user namespace, in stead of it taking 30 seconds to chown the file system, it now takes < 2 seconds. Sadly still almost no one is using User Namespace separated containers, because they are not on by default. The issue is users need to pick out unigue ranges of UIDs for each container they create/launch, and almost no one does. I would propose that we fix this by making Podman do it by default. The idea would be to allocate 2 Billion UIDs on a system and then have podman pick a range of 65K uids for each root running container that it creates. Container/storage would keep track of the selection. This would cause the chowning to happen every time a container was launched. So I would like to continue to focus on the speed of chowning. https://github.com/rhatdan/tools/chown.go is an effort to create a better tool for chowning that takes advantage of multi threading. I would like to get this functionality into containers/storage to get container start times < 1 second, if possible. These features are currently back burnered and could be a good use of a GSOC student. > >> Are you a part of an organization that could consider posting this sort of >> project proposals to GSoC or other internship programs? >> >> Thanks, >> Amir. >>