Re: dlm: message size from 3 too big

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



LM,

That's interesting to note.  I know that there have been some issues with that particular Ethernet switch, but I was assured they had been resolved.  That gives me somewhere to start looking, at any rate.

Thanks,

James

On May 20, 2010, at 11:09 PM, lm chen wrote:

hi,
  
   sticking on gfs2 , so no much time on gfs codes ;
  
   seems it caused by network break ?  and the message is piled up to more than dlm_config.buffer_size ;  


                if (msglen > dlm_config.buffer_size) {
                        printk("dlm: message size from %d too big %d(pkt len=%d)\n", nodeid, msglen, len);
                        khexdump((const unsigned char *) msg, len);
                        break;
                }

if someone have interest to take a look at low_comms code how it's flow control works at sender peer (take into account ,dlm_config.buffer_size)


/**
 * Check status of a cluster service
 *
 * @param svcName       Service name to check.
 * @return              FORWARD, FAIL, 0
 */
int
svc_status(char *svcName)
{
        void *lockp = NULL;
        rg_state_t svcStatus;

        if (rg_lock(svcName, &lockp) < 0) {
                clulog(LOG_ERR, "#48: Unable to obtain cluster lock: %s\n",
                       strerror(errno));
                return FAIL;
        }



2010/5/21 James Chamberlain <jamesc@xxxxxxx>
Hi all,

I've got a three node cluster running CentOS 4.8, GFS-6.1.19-1.el4_8 (GFS 1 filesystems), kernel 2.6.9-89.0.19.ELsmp.  I've seen messages like those below a couple times in the last couple weeks.  Node 3 doesn't go down, so it doesn't get fenced; but DLM is unable to negotiate locks, so the load average on each node spikes and the cluster can't serve anything out through NFS.  Has anyone seen anything like this? Any idea what to do about it?  Shooting node 3 in the head has caused the cluster to recover, but I'd like to know how to fix it rather than work around it.

Thanks,

James

[[Operating normally prior to this point]]
May 20 04:52:50 s12n01 clurgmgrd[7467]: <err> #48: Unable to obtain cluster lock: Connection timed out
May 20 04:53:41 s12n03 clurgmgrd[7476]: <err> #48: Unable to obtain cluster lock: Connection timed out
May 20 04:54:09 s12n03 kernel: dlm: message size from 3 too big 34560(pkt len=386)
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 87-00 00 00 23 00 00 00 00
May 20 04:54:09 s12n03 kernel: 30 b0 d2 84 02 01 00 00-30 b0 d2 84 02 01 00 00
May 20 04:54:09 s12n03 kernel: 8e 64 13 80 ff ff ff ff-f0 7d ee 81 02 01 00 00
May 20 04:54:09 s12n03 kernel: f0 7d ee 81 02 01 00 00-02 02 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: a4 de eb 81 02 01 00 00-00 40 01 00 00 01 00 00
May 20 04:54:09 s12n03 kernel: b8 7c ee 81 02 01 00 00-ff ff ff ff ff ff ff ff
May 20 04:54:09 s12n03 kernel: 82 01 00 00 00 00 00 00-90 de eb 81 02 01 00 00
May 20 04:54:09 s12n03 kernel: 00 10 00 00 00 00 00 00-00 00 00 00 00 01 00 00
May 20 04:54:09 s12n03 kernel: b7 6d db b6 6d db b6 6d-be cb 36 a0 ff ff ff ff
May 20 04:54:09 s12n03 kernel: 60 fa 06 01 00 01 00 00-82 01 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 82 d1 0d 2a 00 01 00 00-7e 0e 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 d0 0d 2a 00 01 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 03 00 00 00
May 20 04:54:09 s12n03 kernel: 68 7e ee 81 02 01 00 00-02 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-90 de eb 81 02 01 00 00
May 20 04:54:09 s12n03 kernel: 01 00 00 00 00 00 00 00-10 50 38 a0 ff ff ff ff
May 20 04:54:09 s12n03 kernel: fc ff ff ff 00 00 00 00-98 3c eb 81 02 01 00 00
May 20 04:54:09 s12n03 kernel: b0 c9 14 80 ff ff ff ff-b2 d1 36 a0 ff ff ff ff
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-91 d0 36 a0 ff ff ff ff
May 20 04:54:09 s12n03 kernel: a8 3c eb 81 02 01 00 00-87 c9 14 80 ff ff ff ff
May 20 04:54:09 s12n03 kernel: ff ff ff ff ff ff ff ff-98 3c eb 81 02 01 00 00
May 20 04:54:09 s12n03 kernel: 30 3c eb 81 02 01 00 00-c0 86 f2 af 00 01 00 00
May 20 04:54:09 s12n03 kernel: 12
May 20 04:54:09 s12n03 kernel: 02
May 20 04:54:09 s12n03 kernel: dlm: midcomms: bad header version 0
May 20 04:54:09 s12n03 kernel: dlm: midcomms: cmd=0, flags=0, length=1024, lkid=1711276032, lockspace=0
May 20 04:54:09 s12n03 kernel: dlm: midcomms: base=000001002a0dd000, offset=1024, len=810, ret=1024, limit=00001000 newbuf=0
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 04-00 00 00 66 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 01 00 01 00
May 20 04:54:09 s12n03 kernel: 03 00 72 00 c0 00 a4 23-17 00 00 01 6a 01 c9 26
May 20 04:54:09 s12n03 kernel: 00 00 00 00 08 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 84 34 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 ff 03 01 16-19 70 00 00 ff 52 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 a6 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 01 00
May 20 04:54:09 s12n03 kernel: 01 00 03 00 72 00 3d 00-7a 2a 17 00 00 01 13 00
May 20 04:54:09 s12n03 kernel: 42 2c 00 00 00 00 08 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 84 34
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 ff 03-01 16 19 70 00 00 8c 0a
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 37 00 00 00 53
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 01 00 01 00 03 00 72 00-49 03 b5 26 17 00 00 01
May 20 04:54:09 s12n03 kernel: 56 01 69 26 00 00 00 00-08 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 84 34 00 00 00 00 00 00-ff 03 01 16 19 70 00 00
May 20 04:54:09 s12n03 kernel: dc fc 00 00 00 00 00 00-00 00 00 00 00 0f 00 00
May 20 04:54:09 s12n03 kernel: 00 6d 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 01 00 01 00 03 00-72 00 75 00 b7 26 17 00
May 20 04:54:09 s12n03 kernel: 00 01 93 02 4f 26 00 00-00 00 08 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 84 34 00 00 00 00-00 00 ff 03 01 16 19 70
May 20 04:54:09 s12n03 kernel: 00 00 62 a7 00 00 00 01-00 00 00 00 00 00 00 50
May 20 04:54:09 s12n03 kernel: 00 00 00 76 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 01 00 01 00-03 00 72 00 8e 02 2d 26
May 20 04:54:09 s12n03 kernel: 17 00 00 01 81 03 85 27-00 00 00 00 08 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 84 34 00 00-00 00 00 00 ff 03 01 16
May 20 04:54:09 s12n03 kernel: 19 70 00 00 6c a4 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 44 00 00 00 49 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 01 00-01 00 03 00 72 00 5b 00
May 20 04:54:09 s12n03 kernel: f0 21 17 00 00 01 3a 02-fb 2b 00 00 00 00 08 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 84 34-00 00 00 00 00 00 ff 03
May 20 04:54:09 s12n03 kernel: 01 16 19 70 00 00 ff 74-00 00 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 83-00 00 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-01 00 01 00 03 00 72 00
May 20 04:54:10 s12n03 kernel: 9a 02 18 23 17 00 00 01-b9 02 f5 2d 00 00 00 00
May 20 04:54:10 s12n03 kernel: 08 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-84 34 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: ff 03 01 16 19 70 00 00-ff 83 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 75 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 01 00 01 00 03 00
May 20 04:54:10 s12n03 kernel: 72 00 1b 01 86 2a 17 00-00 01 56 02 d8 28 00 00
May 20 04:54:10 s12n03 kernel: 00 00 08 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 84 34 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 ff 03 01 16 19 70-00 00 fe 82 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 64 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 01 00 01 00
May 20 04:54:10 s12n03 kernel: 03 00 72 00 c6 01 fc 27-17 00 00 01 e0 00 f6 28
May 20 04:54:10 s12n03 kernel: 00 00 00 00 08 00 00 00-00 00 00 00 00 00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 84 34 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00
May 20 04:54:10 s12n03 kernel: ff 03 01 16
May 20 04:54:10 s12n03 kernel: 19 70 00 00
May 20 04:54:10 s12n03 kernel: f7
May 20 04:54:10 s12n03 kernel: a3
May 20 04:54:10 s12n03 kernel: 00
May 20 04:54:10 s12n03 kernel: 00
May 20 04:54:10 s12n03 kernel: dlm: lowcomms: addr=000001002a0dd000, base=0, len=1834, iov_len=3710, iov_base[0]=000001002a0dd72a, read=1448
May 20 04:54:50 s12n01 clurgmgrd[7467]: <err> #50: Unable to obtain cluster lock: Connection timed out
May 20 04:56:41 s12n03 clurgmgrd[7476]: <err> #50: Unable to obtain cluster lock: Connection timed out
May 20 05:02:13 s12n02 clurgmgrd[7527]: <err> #48: Unable to obtain cluster lock: Connection timed out
May 20 05:05:13 s12n02 clurgmgrd[7527]: <err> #50: Unable to obtain cluster lock: Connection timed out
May 20 05:08:13 s12n02 clurgmgrd[7527]: <err> #48: Unable to obtain cluster lock: Connection timed out
[...]

When I say the load spikes, this is what I mean:

Linux 2.6.9-89.0.19.ELsmp (s12n01)      05/20/2010

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
[...]
04:00:01 AM         0       820      0.18      1.84      3.66
04:10:01 AM         0       820      2.72      4.03      4.04
04:20:01 AM         0       820      3.57      4.62      4.64
04:30:01 AM         0       820     11.42      7.35      5.44
04:40:01 AM         0       820      4.20      7.51      7.10
04:50:01 AM         0       820      1.69      2.18      4.33
05:00:01 AM         0       820    513.68    406.40    205.61
05:10:01 AM         0       820    530.02    513.44    360.00
05:20:01 AM         0       820    530.06    527.83    440.93
05:30:01 AM         0       820    530.12    529.75    483.33
05:40:01 AM         0       820    530.07    530.04    505.57
05:50:01 AM         0       820    530.08    530.05    517.21
06:00:01 AM         0       820    530.02    530.03    523.29
[...]

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux