Hi, Sorry for the late replay On 2010/03/30, at 12:05, Serge E. Hallyn wrote: > Quoting Jiro SEKIBA (jir@xxxxxxxxxxxxxxxxx): >> Hi >> >> On 2010/03/25, at 1:47, Serge E. Hallyn wrote: >> >>> Quoting Jiro SEKIBA (jir@xxxxxxxxxxxxxxxxx): >>>>> If it doesn't work, can you please describe again the exact order of >>>>> commands that you use and the reported error(s) ? >>>>> >>>> I'll let you know in any cases. >>>> >>>> Thank you very much for the advice >>> >>> Hi Jiro, >>> >>> Can you fetch the latest cr_tests >>> (git clone git://git.sr71.net/~hallyn/cr_tests) >>> >>> and >>> cd cr_tests; make; cd simple >>> sh runtests.sh >>> >>> and tell me whether the second (restart --self) test succeeds? >>> If it fails, can you send me the cr_*/log2 contents? >>> >> >> I've tried on ckpt-v20 and the above test looks OK. >> And looks like self_checkpointing is working fine so far. >> >> However, I'm still not able to restart external checkpoint correctly. >> >> Here are the program and scripts I used for the test. >> I used user-cr ckpt-v20 branch for checkpoint/restart program. >> >> This time I disconnect the program from tty completely. >> >> ----------8<----------8<----------test.c----------8<----------8<---------- >> #include <stdio.h> >> #include <unistd.h> >> >> int main(void) >> { >> FILE *fp; >> int i; >> pid_t pid; >> int st; >> >> if(fork()) { >> return 0; > > Odd thing to do, not sure if you had a reason for it. Still, > should be fine :) > >> } else { >> waitpid(getppid(), &st, NULL); >> >> close(0); >> close(1); >> close(2); >> setsid(); >> >> if(fork()) { >> return 0; >> } else >> waitpid(getppid(), &st, NULL); >> } >> >> //unlink("/tmp/test.out"); >> fp = fopen("/tmp/test.out","w"); >> >> for(i=0;i<10;i++) { >> fprintf(fp,"%d\n",i); >> fflush(fp); >> sleep(1); >> } >> >> fclose(fp); >> return 0; >> } >> ----------8<----------8<----------test.c----------8<----------8<---------- >> >> ----------8<----------8<----------checkpoint.sh----------8<----------8<---------- >> #!/bin/sh >> >> CLOG=checkpoint.log >> RLOG=restart.log >> rm -f $CLOG $RLOG >> >> ./test & >> sleep 1 >> PID=$(ps x | grep test | grep -v grep |cut -f 2 -d' ') >> >> sleep 2 >> echo $PID > /cgroup/0/tasks >> >> echo FROZEN > /cgroup/0/freezer.state >> ./checkpoint -l $CLOG -v $PID > ckpt.image >> >> mv /tmp/test.out /tmp/test.out.orig >> cp /tmp/test.out.orig /tmp/test.out >> >> echo THAWED > /cgroup/0/freezer.state >> >> ./restart --pidns -l $RLOG -v -i ckpt.image; >> ----------8<----------8<----------checkpoint.sh----------8<----------8<---------- >> >> When I run the above script, I got following: >> >> # mount -t cgroup -o freezer cgroup /cgroup >> # mkdir /cgroup/0 >> # sh checkpoint.sh >> checkpoint id 8 >> Success >> >> Then, I'm expecting to see number 0 to 9 in /tmp/test.out, but >> I only got 0 to 3, which is the state I froze and checkpointed the process. >> >> checkpoint.log and restart.log are empty. >> I guess it means the programs worked fine. >> >> I attached the dmesg I got by the single session of the script. >> It looks the restart tries to reopen /tmp/test.out. >> >> Could you give me any clues that I should check with? > > Hmm, with ckpt-v20 of both kernel and user, on a powerpc system, I get: > > elm3b203:/usr/src/jiro # sh checkpoint.sh > checkpoint id 146 > Success > elm3b203:/usr/src/jiro # ls > checkpoint.log checkpoint.sh ckpt.image restart.log test test.c > elm3b203:/usr/src/jiro # cat /tmp/test.out > 0 > 1 > 2 > 3 > 4 > 5 > 6 > 7 > 8 > 9 Nhh, OK, thank you very much for the testing the script. So what I'm doing is not a pointless so far.. >> My environment is Virtualbox VM. >> I tried both with VT and without VT. >> No virtualbox guest module is installed. > > What distro are you on? > I'm using ubuntu 9.10. And I found that this ubuntu is using eglibc instead of glibc. version is eglibc-2.10.1 > Anyway, two things to do. First, add '-d' to your restart flags, so > > restart --pidns -l $RLOG -vd -i ckpt.image > I got following.. looks like somehow getting SEGV right after restarting. I attached the whole log. --------8<--------8<--------8<--------8<--------8<-------- <6113>number of tasks: 1 <6113>total tasks (including ghosts): 1 <6113>====== TASKS <6113> [0] pid 6102 ppid 1 sid 0 creator 0 <6113>............ <6114>====== PIDS ARRAY <6114>[0] pid 6102 ppid 1 sid 0 pgid 0 <6114>............ <6113>new pidns without init <6113>forking coordinator in new pidns <1>forking child vpid 6102 flags 0x1 <6102>root task pid 6102 <6102>pid 6102: pid 6102 sid 0 parent 1 <6114>c/r read input 16384 <6114>c/r read input 16384 <6114>c/r read input 16384 <6114>c/r read input 16384 <6102>about to call sys_restart(), flags 0 <1>forked child vpid 6102 (asked 6102) <6114>c/r read input 16384 <6114>c/r read input 16384 ... <6114>c/r read input 16384 <6114>c/r read input 3605 <6114>c/r read input 0 <1>restart succeeded <1>SIGCHLD: already collected <1>task terminated with signal 11 <1>mimic sig 11 <1>c/r succeeded <6113>SIGCHLD: already collected <6113>task exited with status 0 --------8<--------8<--------8<--------8<--------8<-------- > That will give you debugging info. For instance I get: > > checkpoint id 147 > <2507>number of tasks: 1 > <2507>total tasks (including ghosts): 1 > <2507>====== TASKS > <2507> [0] pid 2497 ppid 1 sid 0 creator 0 > <2507>............ > <2507>new pidns without init > <2507>forking coordinator in new pidns > <2508>====== PIDS ARRAY > <2508>[0] pid 2497 ppid 1 sid 0 pgid 0 > <2508>............ > <1>forking child vpid 2497 flags 0x1 > <1>forked child vpid 2497 (asked 2497) > <2497>root task pid 2497 > <2497>pid 2497: pid 2497 sid 0 parent 1 > <2497>about to call sys_restart(), flags 0 > <2508>c/r read input 16384 > <2508>c/r read input 16384 > <2508>c/r read input 16384 > <2508>c/r read input 16384 > <2508>c/r read input 16384 > <2508>c/r read input 16384 > <2508>c/r read input 16384 > <2508>c/r read input 16384 > <2508>c/r read input 16384 > <2508>c/r read input 16384 > <2508>c/r read input 16384 > <2508>c/r read input 8336 > <2508>c/r read input 0 > Success > <1>restart succeeded > <1>SIGCHLD: already collected > <1>task exited with status 0 > <1>mimic ret 0 > <1>c/r succeeded > <2507>SIGCHLD: already collected > <2507>task exited with status 0 > > > The other thing is to restart frozen and attach strace or gdb to the > restarted test before thawing. So perhaps > > # cc -g -o test test.c > # sh checkpoint.sh > > Then when that has failed, do > > # mkdir /cgroup/1 > # restart -F /cgroup/1 -i ckpt.image > > That will hang. Then in another terminal, you can > > # gdb -se test -p `pidof test` > > and in a third terminal, > > # echo THAWED > /cgroup/1/freezer.state > > Now in gdb you can figure out where the task is and step through > to see where it dies. I attached restarted process and found where I got SEGV, here are the corresponding gdb log: --------8<--------8<--------8<--------8<--------8<-------- 0xb77eba50 in __nanosleep_nocancel () from /lib/tls/i686/cmov/libc.so.6 (gdb) n Single stepping until exit from function __nanosleep_nocancel, which has no line number information. __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:139 139 if (result == 0 && seconds != 0) (gdb) n 148 } (gdb) main () at test.c:33 33 for(i=0;i<10;i++) { (gdb) 34 fprintf(fp,"%d\n",i); (gdb) s __fprintf (stream=0x93b3008, format=0x8048801 "%d\n") at fprintf.c:27 27 __fprintf (FILE *stream, const char *format, ...) (gdb) 33 done = vfprintf (stream, format, arg); (gdb) s _IO_vfprintf_internal (s=0x93b3008, format=0x8048801 "%d\n", ap=0xbf9c5448 "\004") at vfprintf.c:210 210 { (gdb) 245 int save_errno = errno; (gdb) 210 { (gdb) n 245 int save_errno = errno; (gdb) p save_errno $1 = 2 (gdb) p errno Cannot find thread-local variables on this target (gdb) n Program received signal SIGSEGV, Segmentation fault. _IO_vfprintf_internal (s=0x93b3008, format=0x8048801 "%d\n", ap=0xbf9c5448 "\004") at vfprintf.c:245 245 int save_errno = errno; --------8<--------8<--------8<--------8<--------8<-------- looks like errno is missing. nhh. I also attached the whole log of gdb. Thanks, regards, > > thanks, > -serge
Attachment:
gdb.log
Description: Binary data
Attachment:
restart-error.log
Description: Binary data
_______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers