On 12/15/2011 04:32 PM, Changliang Chen wrote: > Hi pranithk, > > Thanks for your replay. > Because to keep availability,we haven't strace the process.After > shudowning the damon,the cluster recover. > In our case, > 10.1.1.64(dfs-client-6): online node,when the other node(65) > restart,cpu usr usage reach 100% (glusterfsd process) > 10.1.1.65(dfs-client-7): offline node,when it restart,the client > nfs mount point unavailable. > The nfs.log show that the reason of issue will be cause by client-6 > high cpu usage,there are lots of error like: > > [2011-12-14 13:25:53.30308] E [rpc-clnt.c:197:call_bail] > 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) > op(XATTROP(33)) xid = 0x89279937x sent = 2011-12-14 13:25:20. > 346007. timeout = 30 > > > > > > > > On Wed, Dec 14, 2011 at 6:49 PM, Pranith Kumar K <pranithk at gluster.com > <mailto:pranithk at gluster.com>> wrote: > > On 12/14/2011 03:06 PM, Changliang Chen wrote: >> Hi,we have use glusterfs for two years. After upgraded to >> 3.2.5,we discover that when one of replicate node reboot and >> startup the glusterd daemon,the gluster will crash cause by the >> other >> >> replicate node cpu usage reach 100%. >> >> Our gluster info: >> >> Type: Distributed-Replicate >> Status: Started >> Number of Bricks: 5 x 2 = 10 >> Transport-type: tcp >> Options Reconfigured: >> performance.cache-size: 3GB >> performance.cache-max-file-size: 512KB >> network.frame-timeout: 30 >> network.ping-timeout: 25 >> cluster.min-free-disk: 10% >> >> Our device: >> >> Dell R710 >> 600Gsas *6 >> 3*8Gmem >> >> The error info: >> >> [2011-12-14 13:24:10.483812] E [rdma.c:4813:init] >> 0-rdma.management: Failed to initialize IB Device >> [2011-12-14 13:24:10.483828] E >> [rpc-transport.c:742:rpc_transport_load] 0-rpc-transport: 'rdma' >> initialization failed >> [2011-12-14 13:24:10.483841] W >> [rpcsvc.c:1288:rpcsvc_transport_create] 0-rpc-service: cannot >> create listener, initing the transport failed >> [2011-12-14 13:24:11.967621] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >> Unknown key: brick-0 >> [2011-12-14 13:24:11.967665] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >> Unknown key: brick-1 >> [2011-12-14 13:24:11.967681] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >> Unknown key: brick-2 >> [2011-12-14 13:24:11.967695] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >> Unknown key: brick-3 >> [2011-12-14 13:24:11.967709] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >> Unknown key: brick-4 >> [2011-12-14 13:24:11.967723] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >> Unknown key: brick-5 >> [2011-12-14 13:24:11.967736] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >> Unknown key: brick-6 >> [2011-12-14 13:24:11.967750] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >> Unknown key: brick-7 >> [2011-12-14 13:24:11.967764] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >> Unknown key: brick-8 >> [2011-12-14 13:24:11.967777] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >> Unknown key: brick-9 >> [2011-12-14 13:24:12.465565] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.17:1013 <http://10.1.1.17:1013>) >> [2011-12-14 13:24:12.465623] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.8:1013 <http://10.1.1.8:1013>) >> [2011-12-14 13:24:12.465656] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.10:1013 <http://10.1.1.10:1013>) >> [2011-12-14 13:24:12.465686] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.11:1013 <http://10.1.1.11:1013>) >> [2011-12-14 13:24:12.465716] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.125:1013 <http://10.1.1.125:1013>) >> [2011-12-14 13:24:12.633288] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.65:1006 <http://10.1.1.65:1006>) >> [2011-12-14 13:24:13.138150] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.1:1013 <http://10.1.1.1:1013>) >> [2011-12-14 13:24:13.284665] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.3:1013 <http://10.1.1.3:1013>) >> [2011-12-14 13:24:15.790805] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.8:1013 <http://10.1.1.8:1013>) >> [2011-12-14 13:24:16.113430] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.125:1013 <http://10.1.1.125:1013>) >> [2011-12-14 13:24:16.259040] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.10:1013 <http://10.1.1.10:1013>) >> [2011-12-14 13:24:16.392058] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.17:1013 <http://10.1.1.17:1013>) >> [2011-12-14 13:24:16.429444] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: >> reading from socket failed. Error (Transport endpoint is not >> connected), peer (10.1.1.11:1013 <http://10.1.1.11:1013>) >> [2011-12-14 13:26:05.787680] W >> [glusterfsd.c:727:cleanup_and_exit] >> (-->/lib64/libc.so.6(clone+0x6d) [0x37c8ed3c2d] >> (-->/lib64/libpthread.so.0 [0x37c96064a7] >> (-->/opt/glusterfs/3.2.5/sbin/glusterd(glusterfs_sigwaiter+0x17c) >> [0x40477c]))) 0-: received signum (15), shutting down >> >> -- >> >> Regards, >> >> Cocl >> >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > hi Changliang, > Could you specify which process crashed. Is it glusterd or > glusterfs? Could you provide the stack trace that is present in > it's respective logfile. I dont see any stack trace in the logs > you have provided. > > Pranith > > > > > -- > > Regards, > > Cocl > OM manager > 19lou Operation & Maintenance Dept Could you send the logs of all the machines, we will check and getback to you. Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20111215/0f845f3e/attachment-0001.htm>