On Thu, Aug 08, 2013 at 02:23:11AM +0200, Valerio Pachera wrote: > Auch! This time things went bad: > cluster has stopped. > > But it seem not to be related to the recover of the "small" disk I > inserted back (/mnt/sheep/dsk01): it didn't get full. > It seems something about the command attr I run yesterday. > > sheepdog001 > /dev/mapper/vg00-sheepdsk01 167G 70G > 97G 43% /mnt/sheep/dsk01 > /dev/sdb1 466G 367G > 100G 79% /mnt/sheep/dsk02 > /dev/sdc1 1,9T 1,2T > 679G 64% /mnt/sheep/dsk03 > > What I find unusual is a call trace on node 'sheepdog001' (the other > nodes have no call trace). > I can see that from dmesg, /var/log/messages and /var/log/syslog > [note: 1][note: 4] > > On all nodes I find corsync messages in /var/log/syslog, all at the > same time 21:00 [note: 2] > In sheepdog001 there's some extra message right after the call trace [note: 4] > > I report also what's int the sheep.log file of the nodes [note: 3] > > I checked the filesystem of /mnt/sheep/dsk01 device and seems fine [note: 5] > > Checking the smart of the disk I notice a value that might means problems: > 183 Runtime_Bad_Block > > I'm running a full test to check the disk status. > > [note: 1] > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725454] attr: page > allocation failure: order:4, mode:0xd0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725468] Pid: 14171, comm: > attr Not tainted 3.2.0-4-amd64 #1 Debian 3.2.46-1 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725474] Call Trace: > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725491] > [<ffffffff810b88d7>] ? warn_alloc_failed+0x11a/0x12d > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725504] > [<ffffffff810c9c7c>] ? next_zone+0x25/0x27 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725515] > [<ffffffff810bb605>] ? __alloc_pages_nodemask+0x704/0x7aa > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725528] > [<ffffffff810eaa2d>] ? kmem_getpages+0x4c/0xd9 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725538] > [<ffffffff810ebdde>] ? fallback_alloc+0x13e/0x1e2 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725548] > [<ffffffff8111312c>] ? getxattr+0x84/0x126 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725555] > [<ffffffff810ec2e5>] ? __kmalloc+0xba/0x112 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725564] > [<ffffffff810364e8>] ? should_resched+0x5/0x23 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725573] > [<ffffffff8111312c>] ? getxattr+0x84/0x126 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725582] > [<ffffffff810ead50>] ? kmem_cache_free+0x2d/0x69 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725592] > [<ffffffff81105943>] ? user_path_at_empty+0x53/0x7b > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725603] > [<ffffffff811138d3>] ? sys_lgetxattr+0x3f/0x5c > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725612] > [<ffffffff8134f0f5>] ? page_fault+0x25/0x30 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725620] > [<ffffffff81353b52>] ? system_call_fastpath+0x16/0x1b > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725626] Mem-Info: > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725630] Node 0 DMA per-cpu: > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725637] CPU 0: hi: > 0, btch: 1 usd: 0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725643] CPU 1: hi: > 0, btch: 1 usd: 0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725648] Node 0 DMA32 per-cpu: > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725655] CPU 0: hi: > 186, btch: 31 usd: 0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725660] CPU 1: hi: > 186, btch: 31 usd: 0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725665] Node 0 Normal per-cpu: > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725671] CPU 0: hi: > 186, btch: 31 usd: 0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725676] CPU 1: hi: > 186, btch: 31 usd: 0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725690] active_anon:12964 > inactive_anon:12371 isolated_anon:0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725693] > active_file:169712 inactive_file:547532 isolated_file:0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725697] unevictable:949 > dirty:1378 writeback:0 unstable:0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725700] free:45231 > slab_reclaimable:147488 slab_unreclaimable:6686 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725703] mapped:3632 > shmem:1940 pagetables:663 bounce:0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725710] Node 0 DMA > free:15880kB min:264kB low:328kB high:396kB active_anon:0kB > inactive_anon:0kB active_file:0kB inactive_file:20kB unevictable:0kB > isolated(anon):0kB isolated(file):0kB present:15684kB mlocked:0kB > dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:8kB > slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB > bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725739] lowmem_reserve[]: > 0 3383 3888 3888 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725748] Node 0 DMA32 > free:152592kB min:58576kB low:73220kB high:87864kB active_anon:45376kB > inactive_anon:19188kB active_file:604448kB inactive_file:2116456kB > unevictable:88kB isolated(anon):0kB isolated(file):0kB > present:3464992kB mlocked:88kB dirty:4388kB writeback:0kB > mapped:2364kB shmem:820kB slab_reclaimable:415320kB > slab_unreclaimable:7688kB kernel_stack:224kB pagetables:716kB > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 > all_unreclaimable? no > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725779] lowmem_reserve[]: > 0 0 505 505 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725788] Node 0 Normal > free:12452kB min:8740kB low:10924kB high:13108kB active_anon:6480kB > inactive_anon:30296kB active_file:74400kB inactive_file:73652kB > unevictable:3708kB isolated(anon):0kB isolated(file):0kB > present:517120kB mlocked:3708kB dirty:1124kB writeback:0kB > mapped:12164kB shmem:6940kB slab_reclaimable:174624kB > slab_unreclaimable:19056kB kernel_stack:1208kB pagetables:1936kB > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 > all_unreclaimable? no > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725818] lowmem_reserve[]: 0 0 0 0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725826] Node 0 DMA: 2*4kB > 2*8kB 1*16kB 3*32kB 4*64kB 3*128kB 3*256kB 2*512kB 3*1024kB 3*2048kB > 1*4096kB = 15880kB > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725848] Node 0 DMA32: > 25550*4kB 5749*8kB 11*16kB 4*32kB 0*64kB 0*128kB 0*256kB 0*512kB > 0*1024kB 2*2048kB 0*4096kB = 152592kB > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725869] Node 0 Normal: > 2051*4kB 19*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB > 0*2048kB 1*4096kB = 12452kB > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725891] 721148 total > pagecache pages > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725895] 1311 pages in swap cache > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725901] Swap cache stats: > add 18691, delete 17380, find 3363212/3363399 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725906] Free swap = 7760204kB > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725910] Total swap = 7815584kB > Aug 7 15:34:05 sheepdog001 kernel: [3543736.785649] 1015792 pages RAM > Aug 7 15:34:05 sheepdog001 kernel: [3543736.785656] 32718 pages reserved > Aug 7 15:34:05 sheepdog001 kernel: [3543736.785664] 636786 pages shared > Aug 7 15:34:05 sheepdog001 kernel: [3543736.785671] 298954 pages non-shared > Aug 7 16:28:04 sheepdog001 rsyslogd-2177: imuxsock begins to drop > messages from pid 23853 due to rate-limiting Looks to me not a big problem, just allocation failed. > > [note: 2] > Aug 7 21:00:07 sheepdog004 corosync[4365]: [TOTEM ] Retransmit > List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b > 758c 758d 758e 758f 7590 7591 7592 > Aug 7 21:00:07 sheepdog004 corosync[4365]: [TOTEM ] Retransmit > List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b > 758c 758d 758e 758f 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599 > 759a 759b 759c > ... Hello corosync guys, is this normal? Sheep daemon detected a network partition. > [note: 3] > sheepdog001 > ... > Aug 07 21:09:53 [main] recover_object_main(626) done:208768 > count:414381, oid:c8d13200032c4f > Aug 07 21:09:53 [main] recover_object_main(626) done:208769 > count:414381, oid:c8d13600033c77 > Aug 07 21:09:54 [main] cdrv_cpg_confchg(602) PANIC: Network partition > is detected It looks like a corosync problem. > Aug 07 21:09:54 [main] crash_handler(181) sheep exits unexpectedly (Aborted). > Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:183: crash_handler > Aug 07 21:09:54 [main] sd_backtrace(848) > /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7ff842f2102f] > Aug 07 21:09:54 [main] sd_backtrace(848) > /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7ff84252d474] > Aug 07 21:09:54 [main] sd_backtrace(848) > /lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7ff8425306ef] > Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:602: cdrv_cpg_confchg > Aug 07 21:09:54 [main] sd_backtrace(848) > /usr/lib/libcpg.so.4(cpg_dispatch+0x594) [0x7ff842a8cd74] > Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:744: corosync_handler > Aug 07 21:09:54 [main] sd_backtrace(834) event.c:209: do_event_loop > Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:795: main > Aug 07 21:09:54 [main] sd_backtrace(848) > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfc) > [0x7ff842519eac] > Aug 07 21:09:54 [main] sd_backtrace(848) sheep() [0x405498] > Aug 07 21:09:54 [main] __dump_stack_frames(744) cannot find gdb > Aug 07 21:09:54 [main] __sd_dump_variable(694) cannot find gdb > Aug 07 21:09:54 [main] crash_handler(487) sheep pid 23853 exited unexpectedly. > > sheepdog002 > ... > Aug 07 08:17:17 [gway 26470] wait_forward_request(176) poll timeout 1, > disks of some nodes or network is busy. Going to poll-wait again > Aug 07 08:17:17 [gway 26476] wait_forward_request(176) poll timeout 1, > disks of some nodes or network is busy. Going to poll-wait again > Aug 07 21:09:54 [main] cdrv_cpg_confchg(602) PANIC: Network partition > is detected > Aug 07 21:09:54 [main] crash_handler(181) sheep exits unexpectedly (Aborted). > Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:183: crash_handler > Aug 07 21:09:54 [main] sd_backtrace(848) > /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7fd1e9afd02f] > Aug 07 21:09:54 [main] sd_backtrace(848) > /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7fd1e9109474] > Aug 07 21:09:54 [main] sd_backtrace(848) > /lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7fd1e910c6ef] > Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:602: cdrv_cpg_confchg > Aug 07 21:09:54 [main] sd_backtrace(848) > /usr/lib/libcpg.so.4(cpg_dispatch+0x594) [0x7fd1e9668d74] > Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:744: corosync_handler > Aug 07 21:09:54 [main] sd_backtrace(834) event.c:209: do_event_loop > Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:795: main > Aug 07 21:09:54 [main] sd_backtrace(848) > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfc) > [0x7fd1e90f5eac] > Aug 07 21:09:54 [main] sd_backtrace(848) sheep() [0x405498] > Aug 07 21:09:54 [main] __dump_stack_frames(744) cannot find gdb > Aug 07 21:09:54 [main] __sd_dump_variable(694) cannot find gdb > Aug 07 21:09:54 [main] crash_handler(487) sheep pid 20833 exited unexpectedly. > > sheepdog003 sheepdog003 > seep.log is empty. > > [note:4] > sheepdog001 syslog > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725454] attr: page > allocation failure: order:4, mode:0xd0 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725468] Pid: 14171, comm: > attr Not tainted 3.2.0-4-amd64 #1 Debian 3.2.46-1 > Aug 7 15:34:05 sheepdog001 kernel: [3543736.725474] Call Trace: > ... > Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message > Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message > Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message > ... > Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message > Aug 7 16:28:04 sheepdog001 rsyslogd-2177: imuxsock begins to drop > messages from pid 23853 due to rate-limiting > Aug 7 21:00:07 sheepdog001 corosync[23846]: [TOTEM ] Retransmit > List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b > 758c 758d 758e 758f 7590 7591 7592 > Aug 7 21:00:07 sheepdog001 corosync[23846]: [TOTEM ] Retransmit > List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b > 758c 758d 758e 758f 7590 7591 7592 7593 7594 7595 7596 > 7597 7598 7599 759a 759b 759c Again the same corosync messages. Cced corosync list. Thanks Yuan _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss