On Sun, Oct 4, 2020 at 10:25 PM Alexander Mikhalitsyn <alexander.mikhalitsyn@xxxxxxxxxxxxx> wrote: > > Some time ago we discussed about the problem of Checkpoint-Restoring > overlayfs mounts [1]. Big thanks to Amir for review and suggestions. > > Brief from previous discussion. > Problem statement: to checkpoint-restore overlayfs mounts we need > to save overlayfs mount state and save it into the image. Basically, > this state for us it's just mount options of overlayfs mount. But > here we have two problems: > > I. during mounting overlayfs user may specify relative paths in upperdir, > workdir, lowerdir options > > II. also user may unmount mount from which these paths was opened during mounting > > This is real problems for us. My first patch was attempt to address both problems. > 1. I've added refcnt get for mounts from which overlayfs was mounted. > 2. I've changed overlayfs mountinfo show algorithm, so overlayfs started to *always* > show full paths for upperdir,workdir,lowerdirs. > 3. I've added mnt_id show-time only option which allows to determine from which mnt_id > we opened options paths. > > Pros: > - we can determine full information about overlayfs mount > - we hold refcnt to mount, so, user may unmount source mounts only > with lazy flag > > Cons: > - by adding refcnt get for mount I've changed possible overlayfs usecases > - by showing *full* paths we can more easily reache PAGE_SIZE limit of > mounts options in procfs > - by adding mnt_id show-only option I've added inconsistency between > mount-time options and show-time mount options > > After very productive discussion with Amir and Pavel I've decided to write new > implementation. In new approach we decided *not* to take extra refcnts to mounts. > Also we decided to use exportfs fhandles instead of full paths. To determine > full path we plan to use the next algo: > 1. Export {s_dev; fhandle} from overlayfs for *all* sources > 2. User open_by_handle_at syscall to open all these fhandles (we need to > determine mount for each fhandle, looks like we can do this by s_dev by linear > search in /proc/<pid>/mountinfo) > 3. Then readlink /proc/<pid>/fd/<opened fd> > 4. Dump this full path+mnt_id > Hi Alex, The general concept looks good to me. I will not provide specific comment on the implementation (it looks fine) until the concept API is accepted by the maintainer. The main thing I want to make sure is that if we add this interface it can serve other use cases as well. During my talk on LPC, I got a similar request from two developers for two different use cases. They wanted a safe method to iterate "changes since baseline" from either within the container or from the host. Your proposed API is a step in the direction for meeting their requirement. The major change is that ioctl (or whatever method) should expose the layers topology of a specific object, not only the overlay instance. For C/R you would query the layers topology of the overlay root dir. My comments of the specific methods below are not meant to object to the choice of ioctl, but they are meant to give the alternative a fair chance. I am kind of leaning towards ioctl myself. > But there is question. How to export this {s_dev; fhandle} from kernel to userspace? > - We decided not to use procfs. Why not? C/R already uses procfs to export fhandle for fanotify/inotify I kind of like the idea of having /sys/fs/overlay/instances etc. It could be useful to many things. > - Amir proposed solution - use xattrs. But after diving into it I've meet problem > where I can set this xattrs? > If I set this xattrs on overlayfs dentries then during rsync, or cp -p=xattr we will copy > this temporary information. No you won't. rsync, cp will only copy xattrs listed with listxattr. Several filesystems, such as cifs and nfs export "object properties" via private xattrs that are not listed in listxattr (e.g. CIFS_XATTR_CIFS_ACL). You are not limited in what you can do in the "trusted.overlay" namespace, for example "trusted.overlay.layers.0.fh" The advantage is that it is very easy to implement and requires less discussions about ABI, but I agree it does feel a bit like a hack. > - ioctls? (this patchset implements this approach) > - fsinfo subsystem (not merged yet) [2] > > Problems with ioctls: > 1. We limited in output data size (16 KB AFAIK) > but MAX_HANDLE_SZ=128(bytes), OVL_MAX_STACK=500(num lowerdirs) > So, MAX_HANDLE_SZ*OVL_MAX_STACK = 64KB which is bigger than limit. > So, I've decided to give user one fhandle by one call. This is also > bad from the performance point of view. > 2. When using ioctls we need to have *fixed* size of input and output. > So, if MAX_HANDLE_SZ will change in the future our _IOR('o', 2, struct ovl_mnt_opt_fh) > will also change with struct ovl_mnt_opt_fh. > The choice of API with fixed output size for a variable length info seems weird. I am tempted to suggest extending name_to_handle_at(), for example name_to_handle_at(ovl_root_fd, path, &fhandle, &layer_id, AT_LAYER) Where layer_id can be input/output arg. But I acknowledge this is going to be a much harder sell... Thanks, Amir.