Re: [PATCH RFC] fuse: add generic file store

"Enrico Weigelt, metux IT consult" <info@xxxxxxxxx> · Thu, 24 Jun 2021 16:19:27 +0200

On 22.06.21 08:46, Peng Tao wrote:

Or application recovery after panic ;)

Okay, thats a different scenario. But, of course, the application needs
to have the fds already registered before it crashes. This gives two
options:

a) always store them right after opening and do it's own garbage
   collection (e.g. on each close() call). - expensive and complex
b) construct it in a way that evem on critical signals (eg. sigsegv,
   sigbus) a signal handler can still store the fds somewhere (eg. using
   statically allocated memory and extra stack) - tricky

OTOH, i'd try to construct in a way that a crash of some master process
(that can hold the fds even across restarts) is very unlikely to crash,
since it doesn't do much more than that (plus spawning workers).

By the way, if you just wanna store fd's - i'm working on a more generic
solution: an Plan9-style srvfs. It's a file system that stores aready
opened fd's and on open() gives your that fd (instead of a new one).

My previous experimental implementation did that indirectly by bridging
all operations under the hood (tedious to synchronize the whole file
state), but I'm now taking a fresh start w/ adding some "file boxing"
mechanism to the kernel (patches not ready for submission):

* a file systems's open operation can put a pointer to another struct
  file into the struct file it's operating on -- that's what I called
  a "boxed file".
* the places (actually two) that actually create new struct file's and
  call into the open chain (eg. through vfs_open() etc) then do the
  unboxing -- if there's a boxed file, they fetch it out and drop the
  just newly fd.

Alessio already has a similar implementation in his patchset. The RPC
patch tries to make it generic and thus usable for other use cases
like fuse daemon upgrade and panic-recovery.b

I believe this shouldn't be some fuse specific thing. And we certainly
have to make sure that it can't be abused for dos'ing the machine.
Not sure whether that should be accounted to a namespace or cgroup.

I'd hate to run into situations where even killing all processes holding
some file open leads to a situation where it remains open inside the
kernel, thus blocking e.g. unmounting. I already see operators getting
very angy ... :o
This is really a different design approach. The idea is to keep an FD
active beyond the lifetime of a running process so that we can do
panic recovery. Alessio's patchset has similar side effect in some
corner cases and this RFC patch makes it a semantic promise. Whether
ops like it would really depend on what they want.

The problem is: (most) fd's are bound to some processes - when they're
killed, the fd's are closed. Usually you can force a closing files
(and thus allow unmount) by checking via lsof which processes still hold
open fd to them and kill'em. If we can't do that anymore, we can run
into big trouble. There needs to be some clear lifetime control for
that.

Let's look at containers: usually the runtime/orchestration sets up a
bunch of mounts before starting the actual workload inside the 
container. On container shutdown, the processes are killed and then
everything's unmounted again (temporary storage, eg. lvm volumes or
btrfs subvols are removed afterwards).

Now, with the persistant fd's, an unprivileged user can block that
(either accidentially or on purpose). We cannot allow that to happen.

Apropos containers: this really should be, some how, bound to some
namespace (not sure whether mountfs or userns is the right place),
so containers cannot interfer with each other.

I agree but I understand the rationale as well. A normal FUSE
read/write uses FUSE daemon creds so the semantics are the same.
Otherwise as you outline below, we'd have to go through all the
read/write callbacks to make sure none of them is checking process
creds.

I've actually looked deeper into this. There indeed are certain places
with checks for CAP_SYS_ADMIN, but these are really root-only things
where we should think very carefully whether they should work with
fds passed to processes under separate users at all. And they also
never worked with fd passing via unix socket.

--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@xxxxxxxxx -- +49-151-27565287