Saw an interesting and different GFS2 death this morning that I wanted
to pass along in case anyone has insights. We have not seen any of the
"hanging in dlm_posix_lock" since fsck'ing early Sunday morning. In any
case I'm pretty confident that's being triggered by the creation &
deletion of ".lock" files within Dovecot. This was something completely
different and it left some potentially useful debug info in the logs.
Things were running fine when the machine "post2" abruptly died. The
following was found to have been enscribed upon its stone logs:
Nov 5 10:56:28 post2 kernel: original: gfs2_rindex_hold+0x32/0x153 [gfs2]
Nov 5 10:56:28 post2 kernel: pid : 27197
Nov 5 10:56:28 post2 kernel: lock type: 2 req lock state : 3
Nov 5 10:56:28 post2 kernel: new: gfs2_rindex_hold+0x32/0x153 [gfs2]
Nov 5 10:56:28 post2 kernel: pid: 27197
Nov 5 10:56:28 post2 kernel: lock type: 2 req lock state : 3
Nov 5 10:56:28 post2 kernel: G: s:SH n:2/2053b f:s t:SH d:EX/0 l:0
a:0 r:4
Nov 5 10:56:28 post2 kernel: H: s:SH f:H e:0 p:27197 [procmail]
gfs2_rindex_hold+0x32/0x153 [gfs2]
Nov 5 10:56:28 post2 kernel: I: n:23/132411 t:8 f:0x00000010
Nov 5 10:56:28 post2 kernel: ----------- [cut here ] --------- [please
bite here ] ---------
Nov 5 10:56:32 post2 kernel: Kernel BUG at
...ir/build/BUILD/gfs2-kmod-1.92/_kmod_build_/glock.c:950
The fact that it died in procmail indicates that the failure occurred
while writing mail to someone's Inbox. The system wasn't heavily loaded
at the time -- the load averages were a little bit below 1.0 at the time
of the crash.
Also interesting is what happened next. The load average on post1 (the
only other node) shot up over 100, as numerous processes were blocked.
It spent several minutes with an administrative process using 100% of a
CPU -- I believe it was dlm_recoverd though I'm not 100% certain. Then,
just as the load average had come back down to 15-20 and functionality
was returning, it abruptly hung. At this point I reset both cluster
nodes and all was well.
Anyway, if you've seen anything like this or have a clue as to the
cause, I'd love to hear it. Looks like more lock-related glitchiness in
our relatively lock intensive environment.
Thanks,
Allen
--
Allen Belletti
allen@xxxxxxxxxxxxxxx 404-894-6221 Phone
Industrial and Systems Engineering 404-385-2988 Fax
Georgia Institute of Technology
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster