I'm trying to get some performance numbers
out of GFS2. Before describing the problem, I got to mention the previous
dlm_sendd/recvd spinning issue is no longer seen after moving to 2.6.21-rc4
kernel on FC6.
I've a two node GFS2 setup sharing a disk
off node1 using GNBD. I'm running meta-data heavy tests (i.e create/read/delete
tons of small 8k files) from node2. The test kind of hangs in the middle. I
see the following log mesgs in node1,
Mar 29 15:16:23 cfs1 gnbd_serv[2723]:
startup succeeded
Mar 29 15:16:37 cfs1 gnbd_clusterd[2729]: connected
Mar 29 15:17:15 cfs1 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "ciscogfs2:sridhar"
Mar 29 15:17:15 cfs1 kernel: dlm: connecting to 2
Mar 29 15:17:15 cfs1 kernel: dlm: got connection from 2
Mar 29 15:17:15 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: Joined cluster. Now mounting FS...
Mar 29 15:17:16 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: jid=1, already locked for use
Mar 29 15:17:16 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: jid=1: Looking at journal...
Mar 29 15:17:16 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: jid=1: Done
Mar 30 09:58:34 cfs1 kernel: dlm: sridhar: remove fr 2 none
Mar 30 09:58:34 cfs1 kernel: dlm: message size 5457 from 2 too big, buf len 4632
Mar 30 09:58:34 cfs1 kernel: dlm: sridhar: remove fr 2 none
Mar 30 09:58:35 cfs1 last message repeated 51 times
Mar 30 09:58:35 cfs1 kernel: dlm: message size 13880 from 2 too big, buf len 85072
Mar 30 09:58:35 cfs1 kernel: dlm: sridhar: remove fr 2 none
Mar 30 09:58:35 cfs1 last message repeated 3 times
Mar 30 09:58:35 cfs1 kernel: dlm: message size 13880 from 2 too big, buf len 93760
Mar 30 09:58:37 cfs1 kernel: dlm: message size 8224 from 2 too big, buf len 101136
Mar 30 09:58:37 cfs1 kernel: dlm: message size 8224 from 2 too big, buf len 101248
Mar 30 09:58:39 cfs1 kernel: dlm: message size 8224 from 2 too big, buf len 101472
Mar 29 15:16:37 cfs1 gnbd_clusterd[2729]: connected
Mar 29 15:17:15 cfs1 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "ciscogfs2:sridhar"
Mar 29 15:17:15 cfs1 kernel: dlm: connecting to 2
Mar 29 15:17:15 cfs1 kernel: dlm: got connection from 2
Mar 29 15:17:15 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: Joined cluster. Now mounting FS...
Mar 29 15:17:16 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: jid=1, already locked for use
Mar 29 15:17:16 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: jid=1: Looking at journal...
Mar 29 15:17:16 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: jid=1: Done
Mar 30 09:58:34 cfs1 kernel: dlm: sridhar: remove fr 2 none
Mar 30 09:58:34 cfs1 kernel: dlm: message size 5457 from 2 too big, buf len 4632
Mar 30 09:58:34 cfs1 kernel: dlm: sridhar: remove fr 2 none
Mar 30 09:58:35 cfs1 last message repeated 51 times
Mar 30 09:58:35 cfs1 kernel: dlm: message size 13880 from 2 too big, buf len 85072
Mar 30 09:58:35 cfs1 kernel: dlm: sridhar: remove fr 2 none
Mar 30 09:58:35 cfs1 last message repeated 3 times
Mar 30 09:58:35 cfs1 kernel: dlm: message size 13880 from 2 too big, buf len 93760
Mar 30 09:58:37 cfs1 kernel: dlm: message size 8224 from 2 too big, buf len 101136
Mar 30 09:58:37 cfs1 kernel: dlm: message size 8224 from 2 too big, buf len 101248
Mar 30 09:58:39 cfs1 kernel: dlm: message size 8224 from 2 too big, buf len 101472
sar/iostat shows there is no major
network/disk-io traffic going on after this problem. strace'ing any of the GFS
process hangs. Except I see tons of activity in 'aisexec' process with lots of
sendmsg/recvmsg going on. It seems some cluster level component - cman or dlm
error - causes GFS2 to lock up.
Previous tests with random file size (range
0 to 1MB) went thru' fine. But I also remember one of previous block test
(create/read/rw a1GB file) had similar problem.
Anyone seen such a problem? Any clues to
resolve?
thanks,
Sridhar
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster