On Fri, Sep 7, 2012 at 4:57 PM, Bob Peterson <rpeterso@xxxxxxxxxx> wrote: > The question is: what happened to process 5021 and how did it > dequeue the glock without granting it to one of the waiters? > Did process 5021 show up in ps? The clusters are in a production environment so we have to minimize the downtime in these hung states. Right now the task of gathering this debug info is automated and at the end of the capture we reset the passive node to bring the cluster back into service so it is no longer in that state. > If so, I'd dump its call trace > to see what it's doing. In RHEL6 that's a bit easier, for example, > cat /proc/5021/stack or some such. In RHEL/Centos 5 you can always > echo t to /proc/sysrq-trigger and check the console, although if you > don't have your post_fail_delay set high enough, it can cause your node > to get fenced during the output. We will attempt to capture the call trace of the deadlocked processes during the next hang. Thanks for the suggestion. We will also try and trace the inode number back to the actual files on the system involved. A question on the inode numbers in the hangalyzer output. In the glock dump for node2 you have these lines: G: s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100 I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864 >From docs I've read I understand that the glock field 'n:2/81523' tells me that 81523 is the inode number in hex (if the type is 2 or 5). What are the fields in the inode line following the glock mean (at least the n: field)? > > With a quick glance, I can't really see any critical patches missing > from that kernel, although there are a few possibilities; > a lot of work has been done since that version. Yes, I'm pushing to have the clusters upgraded to the latest 5.8 kernel to rule out the possibility that there is a fix included in there. > Any chance of moving > to RHEL or Centos 6.3? Debugging these kinds of issues is easier with > RHEL6 because we have gfs2 kernel-level tracing and such, which doesn't > exist in the 2.6.18 kernels. We can't move the production clusters to 6.3 because other product integration issues prevent that. Would I need more than the updated kernel in 6.3 to get the extra tracing? Perhaps we could compile the updated 6.3 kernel for the 5.8 release? There have been a lot of kernel build changes so I don't know if that is even possible at this point. Thank you for the input. We want to be able to gather enough info to submit a bug report, if it turns out to be that, so the suggestions on what else to capture are very valuable. FYI, we only have self support licenses from RedHat at this point which is why we have not engaged RedHat support directly yet, but we are highly motivated to find the problem. Jason > > Regards, > > Bob Peterson > Red Hat File Systems -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster