Bob Peterson <rpeterso@xxxxxxxxxx> writes: > On Wed, 2008-02-13 at 09:23 +0100, Ferenc Wagner wrote: > >> Thanks! This patch indeed fixed the hang. But of course not the >> mount: >> >> Trying to join cluster "lock_dlm", "pilot:test" >> Joined cluster. Now mounting FS... >> GFS: fsid=pilot:test.4294967295: can't mount journal #4294967295 >> GFS: fsid=pilot:test.4294967295: there are only 6 journals (0 - 5) Hi Bob, Thanks for looking into this. Find my answers below. > The "4294967295" is really a -1 which is a bad return code on the > mount. Aha. I expected something like that, though it looks more like a journal number in the output. Nevermind. > So it should be a process of elimination to find out what went > wrong. Several possibilities of what can be going wrong come to > mind: > > 1. Is it possible that your file system has a different cluster > name ("pilot") from the the cluster name in your cluster.conf file? No: <cluster name="pilot" config_version="3"> in the config. > 2. Perhaps there is another gfs file system with the same name "test" > already mounted? No, there isn't. I rebooted the node several times, and it does not start the cluster infrastructure automatically. > 3. Perhaps it can't find the locking protocol, lock_dlm (I hope)? > Make sure lock_dlm shows up in lsmod. It does: # lsmod | grep lock lock_nolock 3456 0 lock_dlm 21260 1 gfs2 333228 3 gfs,lock_nolock,lock_dlm dlm 108564 10 lock_dlm > 4. Perhaps gfs can't find the rest of the cluster infrastructure? > Check to make sure you did "service cman start" I did cman_tool join, which started aisexec. > and have aisexec running on the system having the problem. Yes, it's still running. > Also, check /var/log/messages for messages pertaining to cluster > problems. Starts with usual stuff, then: openais[4504]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' <lots of component loads> openais[4504]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) <lots of technical data> openais[4504]: [TOTEM] entering GATHER state from 15. openais[4504]: [SERV ] Initialising service handler 'openais extended virtual synchrony service' <lots of similar lines> openais[4504]: [CMAN ] CMAN 2.01.00 (built Feb 12 2008 22:08:05) started openais[4504]: [SYNC ] Not using a virtual synchrony filter. openais[4504]: [TOTEM] Creating commit token because I am the rep. <some state transitions> openais[4504]: [TOTEM] entering OPERATIONAL state. openais[4504]: [CMAN ] quorum regained, resuming activity openais[4504]: [CLM ] got nodejoin message <IP of node1> openais[4504]: [CLM ] got nodejoin message <IP of node3> openais[4504]: [CPG ] got joinlist message from node 3 ccsd[4500]: Initial status:: Quorate *Here* comes something possibly interesting, after fence_tool join: fenced[4543]: fencing deferred to prior member Though it doesn't look like node3 (which has the filesystem mounted) would want to fence node1 (which has this message in its syslog). Is there a command available to find out the current fencing status or history? Then comes the usual error message: kernel: dlm: Using TCP for communications kernel: dlm: connecting to 3 kernel: dlm: got connection from 3 clvmd: Cluster LVM daemon started - connected to CMAN kernel: Trying to join cluster "lock_dlm", "pilot:test" kernel: Joined cluster. Now mounting FS... kernel: GFS: fsid=pilot:test.4294967295: can't mount journal #4294967295 kernel: GFS: fsid=pilot:test.4294967295: there are only 6 journals (0 - 5) > It sounds to me like we should have a better error message for > whatever went wrong. Let's figure that out first and then we can > go about improving the error messages with a bugzilla if needed. Sounds like a plan. Good error messages always help a lot. > We have improved the error messages considerably from earlier. > I don't know what version of the gfs2-utils you have, but that > will contain the common mount helper (/sbin/mount.gfs2 is a hard > link to /sbin/mount.gfs) that does some of this error processing > when mounts fail. So a newer version of the mount helper may be > better at pointing out what it doesn't like about your file system. Maybe, but I'm using cluster-2.01.00, and have bad experience with CVS versions, like dependence on bleeding edge kernel and device mapper. >> # gfs_tool jindex /dev/mapper/gfs-test >> gfs_tool: /dev/mapper/gfs-test is not a GFS file/filesystem >> >> Scary. What may be the problem? The other node is using this >> volume... Can even unmount/remount it. Though in dmesg it says: > > I wouldn't call it scary at all. It sounds like gfs_tool may be > somewhat confused about the mount point. Try using the mount > point that was used on the mount command, not the /dev/mapper > mount point and see if that helps. Well, it helps on the node which has the filesystem mounted. Of course not on the other. Is gfs_tool supposed to work on mounted filesystems only? Probably so. > I've actually been working on making a better version of that code > too--both kernel and userland--that improves how gfs_tool finds > mount points. For RHEL5, they're bugzillas 431951 (gfs_tool) and > 431952 (kernel) respectively. Those changes have not been shipped > yet, due to code freeze, but patches are in the bugzilla records. Do you think I should apply them? It doesn't sound like they would help with this problem. > As for all the kernel dmesgs you noted, that's perfectly normal. > When you mount a gfs file system, it runs through all the journals > regardless, checking if they are clean or need to be replayed, > so that's all those kernel messages mean. They're not locked > (well, they are, but only for a couple seconds). Thanks for the clarification. And what does that deferred fencing mean? -- Regards, Feri. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster