Oren Laadan wrote:
Mike Waychison wrote:
Linus Torvalds wrote:
On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
Ying Han [yinghan@xxxxxxxxxx] wrote:
| Hi Serge:
| I made a patch based on Oren's tree recently which implement a new
| syscall clone_with_pid. I tested with checkpoint/restart process tree
| and it works as expected.
Yes, I think we had a version of clone() with pid a while ago.
Are people _at_all_ thinking about security?
Obviously not.
There's no way we can do anything like this. Sure, it's trivial to do
inside the kernel. But it also sounds like a _wonderful_ attack vector
against badly written user-land software that sends signals and has small
races.
I'm not really sure how this is different than a malicious app going off
and spawning thousands of threads in an attempt to hit a target pid from
a security pov. Sure, it makes it easier, but it's not like there is
anything in place to close the attack vector.
Quite frankly, from having followed the discussion(s) over the last few
weeks about checkpoint/restart in various forms, my reaction to just about
_all_ of this is that people pushing this are pretty damn borderline.
I think you guys are working on all the wrong problems.
Let's face it, we're not going to _ever_ checkpoint any kind of general
case process. Just TCP makes that fundamentally impossible in the general
case, and there are lots and lots of other cases too (just something as
totally _trivial_ as all the files in the filesystem that don't get rolled
back).
In some instances such as ours, TCP is probably the easiest thing to
migrate. In an rpc-based cluster application, TCP is nothing more than
an RPC channel and applications already have to handle RPC channel
failure and re-establishment.
I agree that this is not the 'general case' as you mention above
however. This is the bit that sorta bothers me with the way the
implementation has been going so far on this list. The implementation
that folks are building on top of Oren's patchset tries to be everything
to everybody. For our purposes, we need to have the flexibility of
choosing *how* we checkpoint. The line seems to be arbitrarily drawn at
the kernel being responsible for checkpointing and restoring all
resources associated with a task, and leaving userland with nothing more
than transporting filesystem bits. This approach isn't flexible enough:
Consider the case where we want to stub out most of the TCP file
descriptors with ECONNRESETed sockets because we know that they are RPC
sockets and can re-establish themselves, but we want to use some other
mechanism for TCP sockets we don't know much about. The current
monolithic approach has zero flexibility for doing anything like this,
and I figure out how we could even fit anything like this in.
The flexibility exists, but wasn't spelled out, so here it is:
1) Similar to madvice(), I envision a cradvice() that could tell the c/r
something about specific resources, e.g.:
* cradvice(CR_ADV_MEM, ptr, len) -> don't save that memory, it's scratch
* cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET) -> reset connection on restart
etc .. (nevermind the exact interface right now)
2) Tasks can ask to be notified (e.g. register a signal) when a checkpoint
or a restart complete successfully. At that time they can do their private
house-keeping if they know better.
3) If restoring some resource is significantly easier in user space (e.g. a
file-descriptor of some special device which user space knows how to
re-initialize), then the restarting task can prepare it ahead of time,
and, call:
* cradvice(CR_ADV_USERFD, fd, 0) -> use the fd in place instead of trying
to restore it yourself.
This would be called by the embryo process (mktree.c?) before calling
sys_restart?
Method #3 is what I used in Zap to implement distributed checkpoints, where
it is so much easier to recreate all network connections in user space then
putting that logic into the kernel.
Now, on the other hand, doing the c/r from userland is much less flexible
than in the kernel (e.g. epollfd, futex state and much more) and requires
exposing tremendous amount of in-kernel data to user space. And we all know
than exposing internals is always a one-way ticket :(
[...]
Oren.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html