gfs withdrawed in function blkalloc_internal

"孙俊伟" <sunjw@xxxxxxxxxxxxxx> · Sat, 13 May 2006 12:10:50 +0800

Hi,all

I have a test cluster with 3 nodes which are nd09, nd10 and nd12.
The cluster software is the newest branch of STABLE, the kernel is 2.6.15.

In nd12:
I have 11 process to sequentially write to the GFS without speed limit,
each process will remove an oldest file after write finish of a newest file.
1 process to do 'ls' of the whole GFS.
200 thread to concurrently read 200 files which are written by the above processes.
5 process to do 'df' of the GFS with 0.5 second interval.

In nd10:
I have 1 process to write.
200 thread to read the same files in nd12.
1 process to do 'ls'.
5 process to do 'df'. 

In nd09:
200 thread to read the same files in nd12.
1 process to do 'ls'.
5 process to do 'df'.

After about 10 hours of the test, gfs withdrawed in node nd10 and nd12, the messages were:
<--
May 13 07:30:47 nd12 kernel: GFS: fsid=test:gfs-dm1.2: fatal: assertion "x <= length" failed
May 13 07:30:47 nd12 kernel: GFS: fsid=test:gfs-dm1.2:   function = blkalloc_internal
May 13 07:30:47 nd12 kernel: GFS: fsid=test:gfs-dm1.2:   file = /home/sunjw/projects/cluster.STABLE/gfs-

kernel/src/gfs/rgrp.c
, line = 1458
May 13 07:30:47 nd12 kernel: GFS: fsid=test:gfs-dm1.2:   time = 1147476646
May 13 07:30:47 nd12 kernel: GFS: fsid=test:gfs-dm1.2: about to withdraw from the cluster
May 13 07:30:47 nd12 kernel: GFS: fsid=test:gfs-dm1.2: waiting for outstanding I/O
May 13 07:30:47 nd12 kernel: GFS: fsid=test:gfs-dm1.2: telling LM to withdraw
May 13 07:30:49 nd12 kernel: lock_dlm: withdraw abandoned memory
May 13 07:30:49 nd12 kernel: GFS: fsid=test:gfs-dm1.2: withdrawn

May 13 07:30:54 nd10 kernel: GFS: fsid=test:gfs-dm1.1: jid=2: Trying to acquire journal lock...
May 13 07:30:54 nd10 kernel: GFS: fsid=test:gfs-dm1.1: jid=2: Busy
May 13 07:36:51 nd10 kernel: GFS: fsid=test:gfs-dm1.1: fatal: assertion "x <= length" failed
May 13 07:36:51 nd10 kernel: GFS: fsid=test:gfs-dm1.1:   function = blkalloc_internal
May 13 07:36:51 nd10 kernel: GFS: fsid=test:gfs-dm1.1:   file = /home/sunjw/projects/cluster.STABLE/gfs-

kernel/src/gfs/rgrp.c
, line = 1458
May 13 07:36:51 nd10 kernel: GFS: fsid=test:gfs-dm1.1:   time = 1147477010
May 13 07:36:51 nd10 kernel: GFS: fsid=test:gfs-dm1.1: about to withdraw from the cluster
May 13 07:36:51 nd10 kernel: GFS: fsid=test:gfs-dm1.1: waiting for outstanding I/O
May 13 07:36:51 nd10 kernel: GFS: fsid=test:gfs-dm1.1: telling LM to withdraw
May 13 07:36:54 nd10 kernel: lock_dlm: withdraw abandoned memory
May 13 07:36:54 nd10 kernel: GFS: fsid=test:gfs-dm1.1: withdrawn

May 13 01:20:05 nd09 kernel: dlm: gfs-dm1: process_lockqueue_reply id 62f203f3 state 0
May 13 01:41:09 nd09 kernel: dlm: gfs-dm1: process_lockqueue_reply id 6fa600de state 0
May 13 07:28:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=2: Trying to acquire journal lock...
May 13 07:28:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=2: Looking at journal...
May 13 07:28:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=2: Acquiring the transaction lock...
May 13 07:28:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=2: Replaying journal...
May 13 07:28:48 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=2: Replayed 160 of 532 blocks
May 13 07:28:48 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=2: replays = 160, skips = 99, sames = 273
May 13 07:28:48 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=2: Journal replayed in 1s
May 13 07:28:48 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=2: Done
May 13 07:34:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=1: Trying to acquire journal lock...
May 13 07:34:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=1: Looking at journal...
May 13 07:34:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=1: Acquiring the transaction lock...
May 13 07:34:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=1: Replaying journal...
May 13 07:34:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=1: Replayed 6 of 71 blocks
May 13 07:34:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=1: replays = 6, skips = 4, sames = 61
May 13 07:34:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=1: Journal replayed in 1s
May 13 07:34:47 nd09 kernel: GFS: fsid=test:gfs-dm1.0: jid=1: Done
-->
The clock of 3 nodes are not in synchronization.
What should be the problem? 

Thanks for any reply,
Luckey

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster