Hello again, Becoming queries why
only once service fails, I tried to encircle the root cause. I ended up that files
in only one directory (were the failing service keeps its files), are corrupted. Trying to ls –l
in the directory gives the following output: ls: reading directory .: Input/output error total 192 ?--------- ? ? ? ? ?
account_boinc.bakerlab.org_rosetta.xml ?--------- ? ? ? ? ?
account_climateprediction.net.xml ?--------- ? ? ? ? ?
account_predictor.chem.lsa.umich.edu.xml ?--------- ? ? ? ? ?
all_projects_list.xml -rw-r--r-- 1 boinc boinc 159796 Jun 22 22:47
client_state_prev.xml ?--------- ? ? ? ? ?
client_state.xml -rw-r--r-- 1 boinc boinc 5141 Jun 13 23:21
get_current_version.xml ?--------- ? ? ? ? ?
get_project_config.xml -rw-r--r-- 1 boinc boinc 899 Apr 4 17:06
global_prefs.xml ?--------- ? ? ? ? ?
gui_rpc_auth.cfg ?--------- ? ? ? ? ?
job_log_boinc.bakerlab.org_rosetta.txt ?--------- ? ? ? ? ?
job_log_predictor.chem.lsa.umich.edu.txt ?--------- ? ? ? ? ? lockfile ?--------- ? ? ? ? ?
lookup_account.xml ?--------- ? ? ? ? ?
lookup_website.html ?--------- ? ? ? ? ?
master_boinc.bakerlab.org_rosetta.xml ?--------- ? ? ? ? ?
master_climateprediction.net.xml ?--------- ? ? ? ? ?
master_predictor.chem.lsa.umich.edu.xml ?--------- ? ? ? ? ? projects ?--------- ? ? ? ? ?
sched_reply_boinc.bakerlab.org_rosetta.xml ?--------- ? ? ? ? ?
sched_reply_climateprediction.net.xml ?--------- ? ? ? ? ?
sched_reply_predictor.chem.lsa.umich.edu.xml ?--------- ? ? ? ? ?
sched_request_boinc.bakerlab.org_rosetta.xml -rw-r--r-- 1 boinc boinc 6766 Jun 22 21:27
sched_request_climateprediction.net.xml ?--------- ? ? ? ? ?
sched_request_predictor.chem.lsa.umich.edu.xml ?--------- ? ? ? ? ? slots ?--------- ? ? ? ? ?
statistics_boinc.bakerlab.org_rosetta.xml ?--------- ? ? ? ? ?
statistics_climateprediction.net.xml ?--------- ? ? ? ? ?
statistics_predictor.chem.lsa.umich.edu.xml ?--------- ? ? ? ? ?
stderrdae.txt ?--------- ? ? ? ? ?
stdoutdae.txt ?--------- ? ? ? ? ?
time_stats_log At the same moment
the kernel reports what is following below (attached the previous e-mail). Trying to rm –rf
the directory fails with the same kernel message. Any ideas on how to
erase the problematic directory? Also the other node
(the one on which I do not try to make any actions on the file system in
question, gives the following message: GFS2: fsid=tweety:gfs2-00.0: jid=1: Trying to acquire
journal lock... GFS2: fsid=tweety:gfs2-00.0: jid=1: Busy And the file system
becomes inaccessible forever. Any one knows why is that? Thank you all for
your time T. Kontogiannis From: linux-cluster-bounces@xxxxxxxxxx
[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Theophanis
Kontogiannis Hello all, I have a two node cluster with DRBD running
in Primary/Primary. Both nodes are running: ·
Kernel 2.6.18-92.1.6.el5.centos.plus ·
GFS2 fsck 0.1.44 ·
cman_tool 2.0.84 ·
Cluster LVM daemon version:
2.02.32-RHEL5 (2008-03-04) Protocol
version: 0.2.1 ·
DRBD Version: 8.2.6 (api:88) After a corruption (which was the result of
combining updating and rebooting with the FS mounted, network interruption
during the reboot and like issues, I keep on getting the following on one node: Jun 30 00:13:40 tweety1 clurgmgrd[5283]:
<notice> stop on script "BOINC" returned 1 (generic error) Jun 30 00:13:40 tweety1 clurgmgrd[5283]:
<info> Services Initialized Jun 30 00:13:40 tweety1 clurgmgrd[5283]:
<info> State change: Local UP Jun 30 00:13:45 tweety1 clurgmgrd[5283]:
<notice> Starting stopped service service:BOINC-t1 Jun 30 00:13:45 tweety1 kernel: GFS2:
fsid=tweety:gfs2-00.0: fatal: invalid metadata block Jun 30 00:13:45 tweety1 kernel: GFS2:
fsid=tweety:gfs2-00.0: bh = 21879736 (magic number) Jun 30 00:13:45 tweety1 kernel: GFS2:
fsid=tweety:gfs2-00.0: function = gfs2_meta_indirect_buffer, file =
fs/gfs2/meta_io.c, line = 332 Jun 30 00:13:45 tweety1 kernel: GFS2:
fsid=tweety:gfs2-00.0: about to withdraw this file system Jun 30 00:13:45 tweety1 kernel: GFS2:
fsid=tweety:gfs2-00.0: telling LM to withdraw Jun 30 00:13:46 tweety1 clurgmgrd[5283]:
<notice> Service service:BOINC-t1 started Jun 30 00:13:46 tweety1 kernel: GFS2:
fsid=tweety:gfs2-00.0: withdrawn Jun 30 00:13:46 tweety1 kernel: Jun 30 00:13:46 tweety1 kernel: Call Trace: Jun 30 00:13:46 tweety1 kernel:
[<ffffffff88629146>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 Jun 30 00:13:46 tweety1 kernel:
[<ffffffff800639de>] __wait_on_bit+0x60/0x6e Jun 30 00:13:46 tweety1 kernel:
[<ffffffff80014eec>] sync_buffer+0x0/0x3f Jun 30 00:13:46 tweety1 kernel:
[<ffffffff80063a58>] out_of_line_wait_on_bit+0x6c/0x78 Jun 30 00:13:46 tweety1 kernel:
[<ffffffff8009d1bb>] wake_bit_function+0x0/0x23 Jun 30 00:13:46 tweety1 kernel:
[<ffffffff8863af7f>] :gfs2:gfs2_meta_check_ii+0x2c/0x38 Jun 30 00:13:46 tweety1 kernel:
[<ffffffff8862ca06>] :gfs2:gfs2_meta_indirect_buffer+0x104/0x15e Jun 30 00:13:46 tweety1 kernel:
[<ffffffff8862795a>] :gfs2:gfs2_inode_refresh+0x22/0x2ca Jun 30 00:13:46 tweety1 kernel:
[<ffffffff8009d1bb>] wake_bit_function+0x0/0x23 Jun 30 00:13:46 tweety1 kernel:
[<ffffffff88626d9c>] :gfs2:inode_go_lock+0x29/0x57 Jun 30 00:13:47 tweety1 kernel:
[<ffffffff88625f04>] :gfs2:glock_wait_internal+0x1d4/0x23f Jun 30 00:13:47 tweety1 kernel:
[<ffffffff8862611d>] :gfs2:gfs2_glock_nq+0x1ae/0x1d4 Jun 30 00:13:47 tweety1 kernel:
[<ffffffff88632053>] :gfs2:gfs2_lookup+0x58/0xa7 Jun 30 00:13:47 tweety1 kernel:
[<ffffffff8863204b>] :gfs2:gfs2_lookup+0x50/0xa7 Jun 30 00:13:47 tweety1 kernel:
[<ffffffff80022663>] d_alloc+0x174/0x1a9 Jun 30 00:13:47 tweety1 kernel:
[<ffffffff8000cbb4>] do_lookup+0xd3/0x1d4 Jun 30 00:13:47 tweety1 kernel: [<ffffffff80009f73>]
__link_path_walk+0xa01/0xf42 Jun 30 00:13:47 tweety1 kernel:
[<ffffffff8861fd37>] :gfs2:compare_dents+0x0/0x57 Jun 30 00:13:47 tweety1 kernel:
[<ffffffff8000e782>] link_path_walk+0x5c/0xe5 Jun 30 00:13:47 tweety1 kernel:
[<ffffffff88624d6f>] :gfs2:gfs2_glock_put+0x26/0x133 After that, the machine freezes completely.
The only way to recover is to power-cycle / reset. “gfs2-fsck –vy
/dev/mapper/vg0-data0” ends (not terminates, it just look like it
finishes) with: Pass5 complete Writing changes
to disk gfs2_fsck:
buffer still held for block: 21875415 (0x14dcad7) After remounting the file system and having
a service start (that has its files on this gfs2 filesystem), the kernel again
crasses with the same message and the node freezes up. Unfortunately due to bad handling, I failed
to DRBD invalidate the problematic node, and instead of making it sync target
(which theoretically would solve the problem, since the good node, would sync
the bad node). Instead I made the bad node, sync source
and now both nodes have the same issue L Any ideas of how can I resolve this issue? Sincerely, Theophanis Kontogiannis |
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster