1) the "easy" one first - the man page for clusvcadm lists a -l option
for locking the service managers. Running clusvcadm shows this option
is no longer available. The man page also references the command
clushutdown, saying this is the preferred way of performing this action,
but on my system I have a man page for clushutdown but no binary. So
... how does one go about doing this?
2) I was having trouble getting my services restarted - using 'clusvcadm
-e httpd' (for example, where I have a service I've named httpd which
sets up an IP address and starts httpd from the /etc/init.d/httpd
script), it complained with the oh-so-informative message: <err> #43:
Service httpd has failed; can not start. I read somewhere that services
had to be disabled and re-enabled after failure, so I tried -d instead
and got the following: <notice> stop on script "httpd init script"
returned 1 (generic error) <crit> #12: RG httpd failed to stop;
intervention required. I finally figured out that I had to manually
start the service on a node, then do clusvcadm -d, then do clusvcadm -e.
Presumably the first step would not have been necessary if the httpd
script didn't return an error status when you pass it stop and it's not
already running.
Any opinions on whether it makes sense to alter init scripts so that
stop when the daemon is not running is not an error (and therefore doing
clusvcadm -d on the not-running service would maybe work)?
3) the biggie: I have a GFS filesystem on a shared FC storage node
(AX100). I haven't put any "real" data on it yet because I'm still
testing, but yesterday I had the cluster up and running and the
filesystem mounted on both nodes, and everything seemed peachy. I came
back this morning to find that any attempted operations (e.g. 'ls') on
the shared system came back with "Input/output error", and the following
appeared in the logs:
node 1:
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: fatal:
invalid metadata block
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: bh =
352612748 (magic)
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: function =
gfs_rgrp_read
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: file =
/usr/src/build/648121-x86_64/BUILD/gfs-kernel-2.6.9-45/smp/src/gfs/rgrp.c,
line = 830
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: time =
1137747789
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: about to
withdraw from the cluster
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: waiting for
outstanding I/O
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: telling LM to
withdraw
Jan 20 04:03:11 knob kernel: lock_dlm: withdraw abandoned memory
Jan 20 04:03:11 knob kernel: GFS: fsid=MAPS:shared_data.0: withdrawn
node 2:
Jan 20 04:02:10 gully kernel: dlm: shared_data: process_lockqueue_reply
id c0012 state 0
Jan 20 04:02:10 gully kernel: dlm: shared_data: process_lockqueue_reply
id 90376 state 0
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0:
Trying to acquire journal lock...
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0:
Looking at journal...
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0:
Acquiring the transaction lock...
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0:
Replaying journal...
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0:
Replayed 0 of 0 blocks
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0:
replays = 0, skips = 0, sames = 0
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0:
Journal replayed in 1s
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: Done
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: fatal:
invalid metadata block
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: bh =
352612748 (magic)
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: function =
gfs_rgrp_read
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: file =
/usr/src/build/648121-x86_64/BUILD/gfs-kernel-2.6.9-45/smp/src/gfs/rgrp.c,
line = 830
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: time =
1137747791
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: about to
withdraw from the cluster
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: waiting for
outstanding I/O
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: telling LM
to withdraw
Jan 20 04:03:11 gully kernel: lock_dlm: withdraw abandoned memory
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: withdrawn
The time coincides with cron.daily firing, so I'm guessing the culprit
is slocate (since that's the only job in cron.daily that would have
touched that filesystem), but I'm not having any luck reproducing it.
The only thing on that filesystem currently is the webroot, and there
were no hits at the time. Any ideas?
-g
Greg Forte
gforte@xxxxxxxx
IT - User Services
University of Delaware
302-831-1982
Newark, DE
--
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster