Corosync death after "FAILED TO RECEIVE" message is fixed by 81ff0e8c94589bb7139d89e573a75473cfc5d173 commit in corosync 1.4.5. Please try to install this version. Regards, Honza Liu Yuan napsal(a): > On 05/07/2013 06:15 PM, Valerio Pachera wrote: >> Hi, my production cluster has crashed. >> I'm trying to understand the causes. >> 3 nodes: sheepdog001 (2T), sheepdog002 (2T), sheepdog004 (500M). >> >> Cluster has been working fine till today and we copy lot's of data on it >> each night. >> This morning I had to expand a vdi from 600G to 1T. >> Then I run a backup process on the vdi using this vdi. >> Backup was reading and writing from the same vdi. >> Guest was running on sheepdog004. >> >> From logs I can see sheepdog002 died first (8:44). >> Rebuilding stared and, later (10:38), sheepdog004 died too. The cluster >> stopped. >> >> Right now I have two qemu processes on sheepdog004 that I can't kill -9. >> Corosync and sheepdog processes are running only on sheepdog001. >> >> I'm going to force reboot on sheepdog004, and normal reboot the other nodes. >> Then I'll run sheep in this order: sheepdog001, sheepdog004, sheepdog002. >> Any suggestion? >> >> Here more info: >> >> root@sheepdog001:~# collie vdi list >> Name Id Size Used Shared Creation time VDI id >> Copies Tag >> zimbra_backup 0 100 GB 99 GB 0.0 MB 2013-04-16 21:41 >> 2e519 2 >> systemrescue 0 350 MB 0.0 MB 0.0 MB 2013-05-07 08:44 >> c8be4d 2 >> backup_data 0 1.0 TB 606 GB 0.0 MB 2013-04-16 21:45 >> c8d128 2 >> crmdelta 0 50 GB 7.7 GB 0.0 MB 2013-04-16 21:32 e149bf >> 2 >> backp 0 10 GB 3.8 GB 0.0 MB 2013-04-16 21:31 f313b6 2 >> >> SHEEPDOG002 >> /var/log/messages >> May 7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped >> >> sheep.log >> May 07 08:44:44 [main] corosync_handler(740) corosync driver received >> EPOLLHUP event, exiting. >> >> /var/log/syslog >> ... >> May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: >> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 >> May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: >> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee >> May 7 08:44:40 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: >> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 >> May 7 08:44:41 sheepdog002 corosync[2777]: [TOTEM ] Retransmit List: >> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee >> May 7 08:44:41 sheepdog002 corosync[2777]: [TOTEM ] FAILED TO RECEIVE >> May 7 08:44:45 sheepdog002 sheep: logger pid 4179 stopped >> >> > > Looks like a Corosync's issue, I have no idea what these logs are. It is > out of my knowledge. CC'ed corosync devel list for help. > >> SHEEPDOG004 >> /var/log/syslog >> May 7 08:35:33 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: >> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 >> 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 >> May 7 08:35:34 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: >> 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 >> 6d9 6da 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee >> May 7 08:35:34 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: >> 6e5 6e6 6e7 6e8 6e9 6ea 6eb 6ec 6ed 6ee 6d1 6d2 6d3 6d4 6d5 6d6 6d7 6d8 >> 6d9 6da 6db 6dc 6dd 6de 6df 6e0 6e1 6e2 6e3 6e4 >> ... >> May 7 10:38:59 sheepdog004 corosync[5314]: [TOTEM ] Retransmit List: >> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 >> May 7 10:38:59 sheepdog004 corosync[5314]: [TOTEM ] FAILED TO RECEIVE >> >> /var/log/messages >> May 7 10:39:04 sheepdog004 sheep: logger pid 15809 stopped >> >> sheep.log >> May 07 08:44:45 [rw 15814] recover_object_work(204) done:0 count:181797, >> oid:c8d12800006f80 >> May 07 08:44:45 [rw 15814] recover_object_work(204) done:1 count:181797, >> oid:c8d1280000b162 >> May 07 08:44:45 [rw 15814] recover_object_work(204) done:2 count:181797, >> oid:c8d1280001773b >> May 07 08:44:45 [rw 15814] recover_object_work(204) done:3 count:181797, >> oid:c8d1280000b5ce >> May 07 08:44:46 [rw 15814] recover_object_work(204) done:4 count:181797, >> oid:c8d1280000b709 >> May 07 08:44:46 [rw 15814] recover_object_work(204) done:5 count:181797, >> oid:2e51900004acf >> ... >> May 07 09:44:17 [rw 19417] recover_object_work(204) done:13869 >> count:181797, oid:c8d1280000b5ae >> May 07 09:44:18 [rw 19417] recover_object_work(204) done:13870 >> count:181797, oid:c8d128000202ff >> May 07 09:44:22 [gway 20481] wait_forward_request(167) poll timeout 1 >> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13871 >> count:181797, oid:c8d12800022fdf >> May 07 09:44:22 [gway 20399] wait_forward_request(167) poll timeout 1 >> May 07 09:44:22 [gway 20429] wait_forward_request(167) poll timeout 1 >> May 07 09:44:22 [gway 20414] wait_forward_request(167) poll timeout 1 >> May 07 09:44:22 [gway 20398] wait_forward_request(167) poll timeout 1 >> ... >> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13872 >> count:181797, oid:c8d1280000b355 >> May 07 09:44:22 [rw 19417] recover_object_work(204) done:13873 >> count:181797, oid:c8d1280000afa4 >> May 07 09:44:23 [rw 19417] recover_object_work(204) done:13874 >> count:181797, oid:c8d128000114ac >> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13875 >> count:181797, oid:c8d128000140e9 >> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13876 >> count:181797, oid:c8d1280001f031 >> May 07 09:44:24 [rw 19417] recover_object_work(204) done:13877 >> count:181797, oid:c8d12800008d92 >> ... >> May 07 10:39:03 [main] corosync_handler(740) corosync driver received >> EPOLLHUP event, exiting. >> > > This means corosync process was gone. (killed?) > >> >> >> SHEEPDOG001 >> /var/log/syslog >> May 7 10:38:58 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: >> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 >> 97 98 99 9a 9b 9c >> May 7 10:38:58 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: >> 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0 >> a1 a2 a3 a4 a5 a6 >> May 7 10:38:59 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: >> 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 >> 97 98 99 9a 9b 9c >> May 7 10:38:59 sheepdog001 corosync[2695]: [TOTEM ] Retransmit List: >> 93 94 95 96 97 98 99 9a 9b 9c 89 8a 8b 8c 8d 8e 8f 90 91 92 9d 9e 9f a0 >> a1 a2 a3 a4 a5 a6 >> May 7 10:39:02 sheepdog001 corosync[2695]: [TOTEM ] A processor >> failed, forming new configuration. >> May 7 10:39:02 sheepdog001 corosync[2695]: [TOTEM ] A processor >> joined or left the membership and a new membership was formed. >> May 7 10:39:02 sheepdog001 corosync[2695]: [CPG ] chosen downlist: >> sender r(0) ip(192.168.6.41) ; members(old:2 left:1) >> May 7 10:39:02 sheepdog001 corosync[2695]: [MAIN ] Completed service >> synchronization, ready to provide service. >> May 7 10:39:03 sheepdog001 corosync[2695]: [TOTEM ] A processor >> joined or left the membership and a new membership was formed. >> May 7 10:39:03 sheepdog001 corosync[2695]: [CPG ] chosen downlist: >> sender r(0) ip(192.168.6.41) ; members(old:1 left:0) >> May 7 10:39:03 sheepdog001 corosync[2695]: [MAIN ] Completed service >> synchronization, ready to provide service. >> >> sheep.log >> ... >> May 07 10:29:15 [rw 20643] recover_object_work(204) done:181794 >> count:181797, oid:c8d1280000117b >> May 07 10:29:15 [rw 20668] recover_object_work(204) done:181795 >> count:181797, oid:c8d128000055fe >> May 07 10:29:15 [rw 20643] recover_object_work(204) done:181796 >> count:181797, oid:c8d12800012667 >> >> > > I hope some guys from Corosync community can take a look at this issue, > you might need more information about Corosync like its version, your > host platform. > > Thanks, > Yuan > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss