Re: [PATCH 9/9][cr][v2]: Restore file-locks

Sukadev Bhattiprolu <sukadev@xxxxxxxxxxxxxxxxxx> · Wed, 26 May 2010 16:57:13 -0700

steve@xxxxxxxxxxx [steve@xxxxxxxxxxx] wrote:
| Hi,
| 
| On Tue, May 18, 2010 at 08:07:32PM -0700, Sukadev Bhattiprolu wrote:
| > Restore POSIX file-locks of an application from its checkpoint image.
| > 
| > Read the saved file-locks from the checkpoint image and for each POSIX
| > lock, call flock_set() to set the lock on the file.
| > 
| > As pointed out by Matt Helsley, no special handling is necessary for a
| > process P2 in the checkpointed container that is blocked on a lock, L1
| > held by another process P1.  Since processes in the restarted container
| > begin execution only after all processes have restored. If the blocked
| > process P2 is restored first, first, it will prepare to return an
| > -ERESTARTSYS from the fcntl() system call, but wait for P1 to be
| > restored. When P1 is restored, it will re-acquire the lock L1 before P1
| > and P2 begin actual execution. This ensures that even if P2 is scheduled
| > to run before P1, P2 will go back to waiting for the lock L1.
| >
| Does that imply certain conditions wrt checkpointed processes and
| NFS exports? I'm not sure I exactly undertstand the use case which
| this is intended to address.

Well, yes this assumes some pre-requisites are met.

First lets look at a single system.  We expect that the application
process tree is run inside a container. This means that the file
system(s) (and other resources like pipes, IPC) that the application
is working with are not modified by a process outside the container.

We also require that the application process tree be frozen before
checkpointing the application. So even if the checkpoint process takes
a few minutes, the state of the resources (files, pipes, signals etc)
does not change since a) application is containerized b) container is
frozen.

We already have the ability to run applications inside containers, using
the clone() system call (see lxc.sf.net for example) and the ability to
freeze the application using the freezer cgroup in the linux kenrnel.

| 
| I was hoping to figure out whether it would also still be safe on
| a cluster filesystem as well,

For clusters and NFS, an external protocol has to be established so that
the distrubuted application can be started/frozen/checkpointed/restarated
in a coordinated way.

I think that is something that would have to be built on top of the
checkpoint/restart functionality that we are working on. Or maybe there
are existing implementations that we would need to plug into.

Hope that helps, but its possible I missed your question :-). If so
please let me know.

| 
| Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html