Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore

Aleksandr Mikhalitsyn <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> · Mon, 6 Mar 2023 17:44:25 +0100

On Mon, Mar 6, 2023 at 5:15 PM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>
> On Mon, 20 Feb 2023 at 20:38, Alexander Mikhalitsyn
> <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote:
> >
> > Hello everyone,
> >
> > It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.
> >
> > Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
> > about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]
> >
> > The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
> > It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
> > restorable process tree, but instead use a "fake daemon".
> >
> > This leads to many other technical problems:
> > * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> > This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> > like no_open, no_flush, readahead, and so on).
> > * each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.
> >
> > We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
> > Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...
> >
> > My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
> > LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
> > one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
> > files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
> > because we have to restart all the containers (fuse bindmounts become invalid after the crash).
> > The solution is fairly easy:
> > allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
> > This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
> > and other stuff, we are only interested in mounts.
> >
> > Current PoC implementation was developed and tested with this "recovery case".
> > Right now I only have LXCFS patched and have nothing for CRIU. But I wanted to discuss this idea before going forward with CRIU.
>

Hi Miklos,

> Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
> are crash recovery related, and not useful for C/R.  Why is this being
> advertised as a precursor for CRIU support?

It's because I'm doing this with CRIU in mind too, I think it's a good
way to make a universal interface
which can address not only the recovery case but also the C/R, cause
in some sense it's a close problem.
But of course, Checkpoint/Restore is a way more trickier. But before
doing all the work with CRIU PoC,
I wanted to consult with you and folks if there are any serious
objections to this interface/feature or, conversely,
if there is someone else who is interested in it.

Now about interfaces REINIT, BM_REVAL.

I think it will be useful for CRIU case, but probably I need to extend
it a little bit, as I mentioned earlier in the cover letter:
> >* "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> > This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> > like no_open, no_flush, readahead, and so on).

So, after the "fake" demon has done its job during CRIU restore, we
need to replace it with the actual demon from
the dumpee tree and performing REINIT looks like a sanner way.

The next point is that if we use REINIT during CRIU restore, then we
automatically need to have BM_REINIT too,
otherwise all restored bind mounts become invalid.

Conn generation is not a problem for CRIU if we are not exposing it to
the userspace. It's just a technical thing to distinguish
old and new inodes/struct file's.

>
> BTW here's some earlier attempt at partial recovery, which might be interesting:
>
>   https://lore.kernel.org/all/CAPm50a+j8UL9g3UwpRsye5e+a=M0Hy7Tf1FdfwOrUUBWMyosNg@xxxxxxxxxxxxxx/

Oh, that's interesting. Thanks for mentioning this! And the most
interesting thing is that Peng Hao mentioned LXCFS as a use case :-)

Added Peng Hao to CC

Kind regards,
Alex

>
> Thanks,
> Miklos