Re: GFS2/DLM deadlock

Jason Henderson <sonredhen@xxxxxxxxx> · Sat, 8 Sep 2012 09:08:19 -0400

On Fri, Sep 7, 2012 at 4:57 PM, Bob Peterson <rpeterso@xxxxxxxxxx> wrote:

> The question is: what happened to process 5021 and how did it
> dequeue the glock without granting it to one of the waiters?
> Did process 5021 show up in ps?

The clusters are in a production environment so we have to minimize
the downtime in these hung states. Right now the task of gathering
this debug info is automated and at the end of the capture we reset
the passive node to bring the cluster back into service so it is no
longer in that state.

> If so, I'd dump its call trace
> to see what it's doing. In RHEL6 that's a bit easier, for example,
> cat /proc/5021/stack or some such. In RHEL/Centos 5 you can always
> echo t to /proc/sysrq-trigger and check the console, although if you
> don't have your post_fail_delay set high enough, it can cause your node
> to get fenced during the output.

We will attempt to capture the call trace of the deadlocked processes
during the next hang. Thanks for the suggestion. We will also try and
trace the inode number back to the actual files on the system
involved.

A question on the inode numbers in the hangalyzer output.

In the glock dump for node2 you have these lines:
G:  s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100
    I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864

>From docs I've read I understand that the glock field 'n:2/81523'
tells me that 81523 is the inode number in hex (if the type is 2 or
5).
What are the fields in the inode line following the glock mean (at
least the n: field)?

>
> With a quick glance, I can't really see any critical patches missing
> from that kernel, although there are a few possibilities;
> a lot of work has been done since that version.

Yes, I'm pushing to have the clusters upgraded to the latest 5.8
kernel to rule out the possibility that there is a fix included in
there.

> Any chance of moving
> to RHEL or Centos 6.3? Debugging these kinds of issues is easier with
> RHEL6 because we have gfs2 kernel-level tracing and such, which doesn't
> exist in the 2.6.18 kernels.

We can't move the production clusters to 6.3 because other product
integration issues prevent that.
Would I need more than the updated kernel in 6.3 to get the extra
tracing? Perhaps we could compile the updated 6.3 kernel for the 5.8
release? There have been a lot of kernel build changes so I don't know
if that is even possible at this point.

Thank you for the input. We want to be able to gather enough info to
submit a bug report, if it turns out to be that, so the suggestions on
what else to capture are very valuable. FYI, we only have self support
licenses from RedHat at this point which is why we have not engaged
RedHat support directly yet, but we are highly motivated to find the
problem.

Jason

>
> Regards,
>
> Bob Peterson
> Red Hat File Systems

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster