Hi pranithk? The attachment provide three logs with nfslog,client6 log,client7 log. On Thu, Dec 15, 2011 at 7:46 PM, Pranith Kumar K <pranithk at gluster.com>wrote: > On 12/15/2011 04:32 PM, Changliang Chen wrote: > > Hi pranithk, > > Thanks for your replay. > Because to keep availability,we haven't strace the process.After > shudowning the damon,the cluster recover. > In our case, > 10.1.1.64(dfs-client-6): online node,when the other node(65) > restart,cpu usr usage reach 100% (glusterfsd process) > 10.1.1.65(dfs-client-7): offline node,when it restart,the client nfs > mount point unavailable. > > The nfs.log show that the reason of issue will be cause by client-6 high > cpu usage,there are lots of error like: > > [2011-12-14 13:25:53.30308] E [rpc-clnt.c:197:call_bail] > 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(XATTROP(33)) > xid = 0x89279937x sent = 2011-12-14 13:25:20. > 346007. timeout = 30 > > > > > > > > On Wed, Dec 14, 2011 at 6:49 PM, Pranith Kumar K <pranithk at gluster.com>wrote: > >> On 12/14/2011 03:06 PM, Changliang Chen wrote: >> >> Hi,we have use glusterfs for two years. After upgraded to 3.2.5,we >> discover that when one of replicate node reboot and startup the glusterd >> daemon,the gluster will crash cause by the other >> >> replicate node cpu usage reach 100%. >> >> Our gluster info: >> >> Type: Distributed-Replicate >> Status: Started >> Number of Bricks: 5 x 2 = 10 >> Transport-type: tcp >> Options Reconfigured: >> performance.cache-size: 3GB >> performance.cache-max-file-size: 512KB >> network.frame-timeout: 30 >> network.ping-timeout: 25 >> cluster.min-free-disk: 10% >> >> Our device? >> >> Dell R710 >> 600Gsas *6 >> 3*8Gmem >> >> The error info: >> >> [2011-12-14 13:24:10.483812] E [rdma.c:4813:init] 0-rdma.management: >> Failed to initialize IB Device >> [2011-12-14 13:24:10.483828] E [rpc-transport.c:742:rpc_transport_load] >> 0-rpc-transport: 'rdma' initialization failed >> [2011-12-14 13:24:10.483841] W [rpcsvc.c:1288:rpcsvc_transport_create] >> 0-rpc-service: cannot create listener, initing the transport failed >> [2011-12-14 13:24:11.967621] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key: >> brick-0 >> [2011-12-14 13:24:11.967665] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key: >> brick-1 >> [2011-12-14 13:24:11.967681] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key: >> brick-2 >> [2011-12-14 13:24:11.967695] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key: >> brick-3 >> [2011-12-14 13:24:11.967709] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key: >> brick-4 >> [2011-12-14 13:24:11.967723] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key: >> brick-5 >> [2011-12-14 13:24:11.967736] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key: >> brick-6 >> [2011-12-14 13:24:11.967750] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key: >> brick-7 >> [2011-12-14 13:24:11.967764] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key: >> brick-8 >> [2011-12-14 13:24:11.967777] E >> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key: >> brick-9 >> [2011-12-14 13:24:12.465565] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.17:1013) >> [2011-12-14 13:24:12.465623] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.8:1013) >> [2011-12-14 13:24:12.465656] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.10:1013) >> [2011-12-14 13:24:12.465686] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.11:1013) >> [2011-12-14 13:24:12.465716] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.125:1013) >> [2011-12-14 13:24:12.633288] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.65:1006) >> [2011-12-14 13:24:13.138150] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.1:1013) >> [2011-12-14 13:24:13.284665] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.3:1013) >> [2011-12-14 13:24:15.790805] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.8:1013) >> [2011-12-14 13:24:16.113430] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.125:1013) >> [2011-12-14 13:24:16.259040] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.10:1013) >> [2011-12-14 13:24:16.392058] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.17:1013) >> [2011-12-14 13:24:16.429444] W >> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading >> from socket failed. Error (Transport endpoint is not connected), peer ( >> 10.1.1.11:1013) >> [2011-12-14 13:26:05.787680] W [glusterfsd.c:727:cleanup_and_exit] >> (-->/lib64/libc.so.6(clone+0x6d) [0x37c8ed3c2d] (-->/lib64/libpthread.so.0 >> [0x37c96064a7] >> (-->/opt/glusterfs/3.2.5/sbin/glusterd(glusterfs_sigwaiter+0x17c) >> [0x40477c]))) 0-: received signum (15), shutting down >> >> -- >> >> Regards, >> >> Cocl >> >> >> >> _______________________________________________ >> Gluster-users mailing listGluster-users at gluster.orghttp://gluster.org/cgi-bin/mailman/listinfo/gluster-users >> >> hi Changliang, >> Could you specify which process crashed. Is it glusterd or >> glusterfs? Could you provide the stack trace that is present in it's >> respective logfile. I dont see any stack trace in the logs you have >> provided. >> >> Pranith >> > > > > -- > > Regards, > > Cocl > OM manager > 19lou Operation & Maintenance Dept > > Could you send the logs of all the machines, we will check and getback to > you. > > Pranith > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20111216/b0956254/attachment-0001.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: etc-glusterfs-glusterd.vol.log_64 Type: application/octet-stream Size: 16831 bytes Desc: not available URL: <http://gluster.org/pipermail/gluster-users/attachments/20111216/b0956254/attachment-0002.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: etc-glusterfs-glusterd.vol.log_65 Type: application/octet-stream Size: 30471 bytes Desc: not available URL: <http://gluster.org/pipermail/gluster-users/attachments/20111216/b0956254/attachment-0003.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: nfs_log.rar Type: application/rar Size: 372034 bytes Desc: not available URL: <http://gluster.org/pipermail/gluster-users/attachments/20111216/b0956254/attachment-0001.rar>