On 12/14/2011 03:06 PM, Changliang Chen wrote: > Hi,we have use glusterfs for two years. After upgraded to 3.2.5,we > discover that when one of replicate node reboot and startup the > glusterd daemon,the gluster will crash cause by the other > > replicate node cpu usage reach 100%. > > Our gluster info: > > Type: Distributed-Replicate > Status: Started > Number of Bricks: 5 x 2 = 10 > Transport-type: tcp > Options Reconfigured: > performance.cache-size: 3GB > performance.cache-max-file-size: 512KB > network.frame-timeout: 30 > network.ping-timeout: 25 > cluster.min-free-disk: 10% > > Our device: > > Dell R710 > 600Gsas *6 > 3*8Gmem > > The error info: > > [2011-12-14 13:24:10.483812] E [rdma.c:4813:init] 0-rdma.management: > Failed to initialize IB Device > [2011-12-14 13:24:10.483828] E > [rpc-transport.c:742:rpc_transport_load] 0-rpc-transport: 'rdma' > initialization failed > [2011-12-14 13:24:10.483841] W [rpcsvc.c:1288:rpcsvc_transport_create] > 0-rpc-service: cannot create listener, initing the transport failed > [2011-12-14 13:24:11.967621] E > [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown > key: brick-0 > [2011-12-14 13:24:11.967665] E > [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown > key: brick-1 > [2011-12-14 13:24:11.967681] E > [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown > key: brick-2 > [2011-12-14 13:24:11.967695] E > [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown > key: brick-3 > [2011-12-14 13:24:11.967709] E > [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown > key: brick-4 > [2011-12-14 13:24:11.967723] E > [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown > key: brick-5 > [2011-12-14 13:24:11.967736] E > [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown > key: brick-6 > [2011-12-14 13:24:11.967750] E > [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown > key: brick-7 > [2011-12-14 13:24:11.967764] E > [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown > key: brick-8 > [2011-12-14 13:24:11.967777] E > [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown > key: brick-9 > [2011-12-14 13:24:12.465565] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.17:1013 <http://10.1.1.17:1013>) > [2011-12-14 13:24:12.465623] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.8:1013 <http://10.1.1.8:1013>) > [2011-12-14 13:24:12.465656] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.10:1013 <http://10.1.1.10:1013>) > [2011-12-14 13:24:12.465686] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.11:1013 <http://10.1.1.11:1013>) > [2011-12-14 13:24:12.465716] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.125:1013 <http://10.1.1.125:1013>) > [2011-12-14 13:24:12.633288] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.65:1006 <http://10.1.1.65:1006>) > [2011-12-14 13:24:13.138150] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.1:1013 <http://10.1.1.1:1013>) > [2011-12-14 13:24:13.284665] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.3:1013 <http://10.1.1.3:1013>) > [2011-12-14 13:24:15.790805] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.8:1013 <http://10.1.1.8:1013>) > [2011-12-14 13:24:16.113430] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.125:1013 <http://10.1.1.125:1013>) > [2011-12-14 13:24:16.259040] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.10:1013 <http://10.1.1.10:1013>) > [2011-12-14 13:24:16.392058] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.17:1013 <http://10.1.1.17:1013>) > [2011-12-14 13:24:16.429444] W > [socket.c:1494:__socket_proto_state_machine] 0-socket.management: > reading from socket failed. Error (Transport endpoint is not > connected), peer (10.1.1.11:1013 <http://10.1.1.11:1013>) > [2011-12-14 13:26:05.787680] W [glusterfsd.c:727:cleanup_and_exit] > (-->/lib64/libc.so.6(clone+0x6d) [0x37c8ed3c2d] > (-->/lib64/libpthread.so.0 [0x37c96064a7] > (-->/opt/glusterfs/3.2.5/sbin/glusterd(glusterfs_sigwaiter+0x17c) > [0x40477c]))) 0-: received signum (15), shutting down > > -- > > Regards, > > Cocl > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users hi Changliang, Could you specify which process crashed. Is it glusterd or glusterfs? Could you provide the stack trace that is present in it's respective logfile. I dont see any stack trace in the logs you have provided. Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20111214/58d1ae25/attachment-0001.htm>