Cluster failure, dlm overload

"wsfax alu.es" <wsfax.alu.es@xxxxxxxxx> · Fri, 6 Apr 2012 00:19:15 +0200

Hi,

First of all, thanks for your time.

A five node 
cluster that is sharing several GFS filesystem is having total blocks 
of filesystem activity. Around one block each week. These blocks 
appeared several weeks ago, after more than three years in service. 
Cluster is restored after restart of all cluster nodes ;-)

When these blocks appears, we can see dlm send and receive process 
with a high level of CPU consumption, network traffic is a also ten 
times the normal one.

A capture (wireshark) of network traffic in
 DLM port shows thousand of messages per second. In particular, all 
"request message" are replied with a "request reply" where errno=EBADR, 
Lookup messages seems ok.

The cluster is with a software version a few outdated, the one of RedHat 2.6.18, but not possible to upgrade easily.

Any suggestion is welcome. 

Kind regards,
ALU
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster