Hi He,
Can you please re-create the problem with -L DEBUG and post both the
client and server side logs?
Thanks,
Vijay
He Xiaobin wrote:
I use glusterfs in a cluster system (configured as:
dht->afr->client->server->iothreads->locks->posix), after days
running, it is stable, but with a poor porformance (slower thann NFS
exported from only one server), and most important is that a bug came
to me these days. This is really an emergency, so I need your help!
What is the BUG? In this system, I use mvapich+blcr for task
checkpoint and restore. I don't know how mvapich works, but I am sure
it used glusterfs in my case. When using glusterfs in checkpointing a
task, it created one ckpt file for each proccess of the task, all the
ckpt files placed in directory called 1, and it will create a symbol
link called 0 pointing to directory 1. There is example, fortest is
username, .ckpt is the ckpt file directory for this user, 1972 is the
task id, 0 is the symbol link and bt.C.64-19.ckpt is a ckpt file the
task's 19th proccess
[fortest@gfsclient02 1972]$ pwd
/mnt/glusterfs/.ckpt/1972
[fortest@gfsclient02 1972]$ ll
total 132
lrwxrwxrwx 1 fortest fortest 31 Sep 4 17:09 0 ->
/mnt/glusterfs/fortest/.ckpt/1972/1
drwx------ 2 fortest fortest 65536 Sep 4 20:06 1
[fortest@gfsclient02 1972]$ ls 1/
bt.C.64-0.ckpt bt.C.64-21.ckpt bt.C.64-33.ckpt bt.C.64-45.ckpt
bt.C.64-57.ckpt
bt.C.64-10.ckpt bt.C.64-22.ckpt bt.C.64-34.ckpt bt.C.64-46.ckpt
bt.C.64-58.ckpt
bt.C.64-11.ckpt bt.C.64-23.ckpt bt.C.64-35.ckpt bt.C.64-47.ckpt
bt.C.64-59.ckpt
bt.C.64-12.ckpt bt.C.64-24.ckpt bt.C.64-36.ckpt bt.C.64-48.ckpt
bt.C.64-5.ckpt
bt.C.64-13.ckpt bt.C.64-25.ckpt bt.C.64-37.ckpt bt.C.64-49.ckpt
bt.C.64-60.ckpt
bt.C.64-14.ckpt bt.C.64-26.ckpt bt.C.64-38.ckpt bt.C.64-4.ckpt
bt.C.64-61.ckpt
bt.C.64-15.ckpt bt.C.64-27.ckpt bt.C.64-39.ckpt bt.C.64-50.ckpt
bt.C.64-62.ckpt
bt.C.64-16.ckpt bt.C.64-28.ckpt bt.C.64-3.ckpt bt.C.64-51.ckpt
bt.C.64-63.ckpt
bt.C.64-17.ckpt bt.C.64-29.ckpt bt.C.64-40.ckpt bt.C.64-52.ckpt
bt.C.64-6.ckpt
bt.C.64-18.ckpt bt.C.64-2.ckpt bt.C.64-41.ckpt bt.C.64-53.ckpt
bt.C.64-7.ckpt
bt.C.64-19.ckpt bt.C.64-30.ckpt bt.C.64-42.ckpt bt.C.64-54.ckpt
bt.C.64-8.ckpt
bt.C.64-1.ckpt bt.C.64-31.ckpt bt.C.64-43.ckpt bt.C.64-55.ckpt
bt.C.64-9.ckpt
bt.C.64-20.ckpt bt.C.64-32.ckpt bt.C.64-44.ckpt bt.C.64-56.ckpt
When the task need to be restored, mvapich will read the ckpt file
from 0 (the symbol link) and restore the task! All this perform
smoothly in NFS, but in glusterfs it will output following messages.
However sometimes task restoring can finish at last, while others
can't almost with the same messages. I have verifed the missing files
mvapich outputed was indeed there. Another useful tips is that fewer
gluster client doing the task, few times it would be came to this bug
when task restoring. And startup glusterfs without direct-io could not
help too.
OUTPUT OF THE TASK WHEN RESTORE:
19: Restart: path /mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-19.ckpt:
No such file or directory20: Restart: path
/mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-20.ckpt: No such file or
directorysrun: error: gfsclient10: task[19-20]: Exited with exit code 1
21: Restart: path /mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-21.ckpt:
No such file or directory18: Restart: path
/mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-18.ckpt: No such file or
directorysrun: error: gfsclient10: task21: Exited with exit code 1
srun: error: cn010: task18: Exited with exit code 1
17: Restart: path /mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-17.ckpt:
No such file or directorysrun: error: gfsclient10: task17: Exited with
exit code 1
23: Restart: path /mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-23.ckpt:
No such file or directory22: Restart: path
/mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-22.ckpt: No such file or
directorysrun: error: gfsclient10: task23: Exited with exit code 1
srun: error: cn010: task[16,22]: Exited with exit code 1
16: Restart: path /mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-16.ckpt:
No such file or directory
I use "debug/trace" and start the gluster with "-L DEBUG", and got the
following logs when the ckpt can't to be found:
[2009-09-04 17:12:35] N [trace.c:1290:trace_readlink] tr0: 174536:
(loc {path=/fortest/.ckp
t/1972/0, ino=1380450540}, size=4096)
[2009-09-04 17:12:35] N [trace.c:484:trace_readlink_cbk] tr0: 174536:
(op_ret=31, op_errno=
0, buf=/mnt/glusterfs/fortest/.ckpt/1972/1)
[2009-09-04 17:12:35] E [fuse-bridge.c:987:fuse_readlink_cbk]
glusterfs-fuse: 174536: /fortest/
.ckpt/1972/0 => /mnt/glusterfs/fortest/.ckpt/1972/1 @ 1252055555
[2009-09-04 17:12:35] N [trace.c:1245:trace_lookup] tr0: 174537: (loc
{path=/fortest/.ckpt/
1972/1, ino=0})
[2009-09-04 17:12:35] N [trace.c:513:trace_lookup_cbk] tr0: 174508:
(op_ret=0, ino=0, *buf
{st_dev=2065, st_ino=7068450884, st_mode=40700, st_nlink=2,
st_uid=1001, st_gid=1001, st_rd
ev=0, st_size=65536, st_blksize=4096, st_blocks=256})
[2009-09-04 17:12:35] E [fuse-bridge.c:255:fuse_loc_fill]
glusterfs-fuse: inode_path failed for
8003256399/bt.C.64-22.ckpt @ 1252055555
[2009-09-04 17:12:35] W [fuse-bridge.c:436:fuse_lookup]
glusterfs-fuse: 174539: LOOKUP 80032563
99/bt.C.64-22.ckpt (fuse_loc_fill() failed)
[2009-09-04 17:12:35] N [trace.c:513:trace_lookup_cbk] tr0: 174522:
(op_ret=0, ino=0, *buf
{st_dev=2065, st_ino=7068450884, st_mode=40700, st_nlink=2,
st_uid=1001, st_gid=1001, st_rd
ev=0, st_size=65536, st_blksize=4096, st_blocks=256})
[2009-09-04 17:12:35] E [fuse-bridge.c:255:fuse_loc_fill]
glusterfs-fuse: inode_path failed for
8003256399/bt.C.64-16.ckpt @ 1252055555
------------------------------------------------------------------------
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
http://lists.nongnu.org/mailman/listinfo/gluster-devel