On Fri, Jun 5, 2020 at 11:41 AM Pavel Tikhomirov <ptikhomirov@xxxxxxxxxxxxx> wrote: > > > > On 6/5/20 5:35 AM, Amir Goldstein wrote: > > On Fri, Jun 5, 2020 at 12:34 AM Alexander Mikhalitsyn > > <alexander.mikhalitsyn@xxxxxxxxxxxxx> wrote: > >> > >> Hello, > >> > >>> But overlayfs won't accept these "output only" options as input args, > >> which is a problem. > >> > >> Will it be problematic if we simply ignore "lowerdir_mnt_id" and "upperdir_mnt_id" options in ovl_parse_opt()? > >> > > > > That would solve this small problem. > > This is not a big problem actually as these options shown in mountinfo > for overlay had been "output only" forever, > please see these two examples below: > > a) Imagine you've mounted overlay with relative paths and forgot (or you > never known as you are another user) where your cwd was at the moment of > mount syscall. - How would you use those options as "input" to create > the same overlay mount somethere else (bind-mounting not involved)? > > b) Imagine you've mounted overlay with absolute paths and someone (other > user) overmounted lower (upper/workdir) paths for you, all directory > structure would be the same on overmount but yet files are different. - > How would you use those options from mountinfo as "input" ones? > > We try to make them much closer to "input" ones. That is not what I meant by "output only" I meant invalid input options as in EINVAL not ENOENT > > Agreed, we should ignore *_mnt_id on mount because paths identify mounts > at the time of mount call. > > > > >>> Wouldn't it be better for C/R to implement mount options > >> that overlayfs can parse and pass it mntid and fhandle instead > >> of paths? >> > >> Problem is that we need to know on C/R "dump stage" which mounts are used on lower layers and upper layer. Most likely I don't understand something but I can't catch how "mount-time" options will help us. > > > > As you already know from inotify/fanotify C/R fhandle is timeless, so > > there would be no distinction between mount time and dump time. > > Pair of fhandle+mnt_id looks an equivalent to path+mnt_id pair, CRIU > will just need to open fhandle+mnt_id with open_by_handle_at and > readlink to get path on dump and continue to use path+mnt_id as before. > (not too common with fhandles but it's my current understanding) > > But if you take a look on (a) and (b) again, the regular user does not > see full information about overlay mount in /proc/pid/mountinfo, they > can't just take a look on it and understand from there it comes from. > Resolving fhandle looks like a too hard task for a user. > Right, and we need to provide regular user needs in parallel to providing C/R needs. understood. > > About mnt_id, your patches will cause the original mount-time mounts to be busy. > > That is a problem as well. > > Children mounts lock parent, open files lock parent. Another analogy is > a loop device which locks the backing file mount (AFAICS). Anyway one > can lazy umount, can't they? But I'm not too sure for this one, maybe > you can share more implications of this problem? > Overlayfs mounts are internal not children mounts in the namespace, so no open files hold reference the mounts in the namespace (AFAICS). This use case will break: mount /dev/vdf /vdf mkdir /vdf/{l,u,w} /tmp/m mount -t overlay overlay /tmp/m -o lowerdir=/vdf/l,upperdir=/vdf/u,workdir=/vdf/w umount /vdf Yes users can lazy unmount, the filesystem itself on /dev/vdf is not actually unmounted, only the /vdf mount goes away from the namespace, but the use case without lazy unmount will still break. Maybe its fine since distro/admin needs to opt-in for this change of behavior. I have to wonder though, why did you add two different config/module options for this feature? Yes, its two different sub-functionalities, but which real user (not imaginary one) will really turn on just half the feature? While at it, you copy pasted the text: For more information, see Documentation/filesystems/overlayfs.txt but there is no more information to be found. > > > > I think you should describe the use case is more details. > > Is your goal to C/R any overlayfs mount that the process has open > > files on? visible to process > We wan't to dump a container, not a simple process, if the container > process has access to some resource CRIU needs to restore this resource. > > Imagine the process in container mounts it's own overlay inside > container, for instance to imulate write access to readonly mount or > just to implement some snapshots, don't know exact use case. And we want > to checkpoint/restore this container. (Currently CRIU only supports > overlay as external mount, e.g. for docker continers docker engine > pre-creates overlay for us and we just bind from it - it's a different > case.) If the in-container process creates the in-container mount we > need to recreate it on restore so that the in-container view of the > filesystem persists. > Understood. but how do you *know* which mounts the container created and need to be migrated? Which loop devices the user has created? As opposed to the ones that docker engine re-created? It is the found from diff between mountinfo of process and host? > > For NFS export, we use the persistent descriptor {uuid;fhandle} > > (a.k.a. struct ovl_fh) to encode > > an underlying layer object. > > > > CRIU can look for an existing mount to a filesystem with uuid as restore stage > > (or even mount this filesystem) and use open_by_handle_at() to open a > > path to layer. > > On restore we can be on another physical node, so I doubt we have same > uuid's, sorry I don't fully understand here already. > I see, so what about inotify/fanotify? fhandle and uuid can be looked up/resolved to mnt/path at "dump" time. The difference between mnt_id/uuid is who keeps the reference on the mount. If overlayfs provides uuid, then you rely on docker to keep the reference on the mount and use the reference-less uuid to find the mount that docker is holding for you. > > After mounting overlay, that mount to underlying fs can even be discarded. > > > > And if this works for you, you don't have to export the layers ovl_fh in > > /proc/mounts, you can export them in numerous other ways. > > One way from the top of my head, getxattr on overlay root dir. > > "trusted.overlay" xattr is anyway a reserved prefix, so "trusted.overlay.layers" > > for example could work. > > Thanks xattr might be a good option, but still don't forget about (a) > and (b), users like to know all information about mount from > /proc/pid/mountinfo. > Let's stick to your use cases requirements. If you have other use cases for this functionality lay them out explicitly. I went to see what losetup does and I see that LOOP_SET_STATUS ioctl stores a path string that LOOP_GET_STATUS gets back in return. Does not seem C/R friendly either. Are you not handling loop devices? It's strange because loop driver keeps an open backing file so LOOP_GET_STATUS could have returned the uptodate path. Thanks, Amir.