Hi Changliang, Could you attach the logs of the servers (bricks)? > >> Because to keep availability,we haven't strace the >> process.After shudowning the damon,the cluster recover. >> In our case, > Pranith was asking for a core dump file or the backtrace (of the crashed process), not strace output. thanks, krish > >> 10.1.1.64(dfs-client-6): online node,when the other node(65) >> restart,cpu usr usage reach 100% (glusterfsd process) >> 10.1.1.65(dfs-client-7): offline node,when it restart,the >> client nfs mount point unavailable. >> The nfs.log show that the reason of issue will be cause >> by client-6 high cpu usage,there are lots of error like: >> >> [2011-12-14 13:25:53.30308] E [rpc-clnt.c:197:call_bail] >> 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) >> op(XATTROP(33)) xid = 0x89279937x sent = 2011-12-14 13:25:20. >> 346007. timeout = 30 >> >> >> >> >> >> >> >> On Wed, Dec 14, 2011 at 6:49 PM, Pranith Kumar K >> <pranithk at gluster.com <mailto:pranithk at gluster.com>> wrote: >> >> On 12/14/2011 03:06 PM, Changliang Chen wrote: >>> Hi,we have use glusterfs for two years. After upgraded to >>> 3.2.5,we discover that when one of replicate node reboot and >>> startup the glusterd daemon,the gluster will crash cause by >>> the other >>> >>> replicate node cpu usage reach 100%. >>> >>> Our gluster info: >>> >>> Type: Distributed-Replicate >>> Status: Started >>> Number of Bricks: 5 x 2 = 10 >>> Transport-type: tcp >>> Options Reconfigured: >>> performance.cache-size: 3GB >>> performance.cache-max-file-size: 512KB >>> network.frame-timeout: 30 >>> network.ping-timeout: 25 >>> cluster.min-free-disk: 10% >>> >>> Our device? >>> >>> Dell R710 >>> 600Gsas *6 >>> 3*8Gmem >>> >>> The error info: >>> >>> [2011-12-14 13:24:10.483812] E [rdma.c:4813:init] >>> 0-rdma.management: Failed to initialize IB Device >>> [2011-12-14 13:24:10.483828] E >>> [rpc-transport.c:742:rpc_transport_load] 0-rpc-transport: >>> 'rdma' initialization failed >>> [2011-12-14 13:24:10.483841] W >>> [rpcsvc.c:1288:rpcsvc_transport_create] 0-rpc-service: >>> cannot create listener, initing the transport failed >>> [2011-12-14 13:24:11.967621] E >>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >>> Unknown key: brick-0 >>> [2011-12-14 13:24:11.967665] E >>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >>> Unknown key: brick-1 >>> [2011-12-14 13:24:11.967681] E >>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >>> Unknown key: brick-2 >>> [2011-12-14 13:24:11.967695] E >>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >>> Unknown key: brick-3 >>> [2011-12-14 13:24:11.967709] E >>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >>> Unknown key: brick-4 >>> [2011-12-14 13:24:11.967723] E >>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >>> Unknown key: brick-5 >>> [2011-12-14 13:24:11.967736] E >>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >>> Unknown key: brick-6 >>> [2011-12-14 13:24:11.967750] E >>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >>> Unknown key: brick-7 >>> [2011-12-14 13:24:11.967764] E >>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >>> Unknown key: brick-8 >>> [2011-12-14 13:24:11.967777] E >>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: >>> Unknown key: brick-9 >>> [2011-12-14 13:24:12.465565] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.17:1013 >>> <http://10.1.1.17:1013>) >>> [2011-12-14 13:24:12.465623] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.8:1013 >>> <http://10.1.1.8:1013>) >>> [2011-12-14 13:24:12.465656] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.10:1013 >>> <http://10.1.1.10:1013>) >>> [2011-12-14 13:24:12.465686] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.11:1013 >>> <http://10.1.1.11:1013>) >>> [2011-12-14 13:24:12.465716] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.125:1013 >>> <http://10.1.1.125:1013>) >>> [2011-12-14 13:24:12.633288] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.65:1006 >>> <http://10.1.1.65:1006>) >>> [2011-12-14 13:24:13.138150] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.1:1013 >>> <http://10.1.1.1:1013>) >>> [2011-12-14 13:24:13.284665] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.3:1013 >>> <http://10.1.1.3:1013>) >>> [2011-12-14 13:24:15.790805] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.8:1013 >>> <http://10.1.1.8:1013>) >>> [2011-12-14 13:24:16.113430] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.125:1013 >>> <http://10.1.1.125:1013>) >>> [2011-12-14 13:24:16.259040] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.10:1013 >>> <http://10.1.1.10:1013>) >>> [2011-12-14 13:24:16.392058] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.17:1013 >>> <http://10.1.1.17:1013>) >>> [2011-12-14 13:24:16.429444] W >>> [socket.c:1494:__socket_proto_state_machine] >>> 0-socket.management: reading from socket failed. Error >>> (Transport endpoint is not connected), peer (10.1.1.11:1013 >>> <http://10.1.1.11:1013>) >>> [2011-12-14 13:26:05.787680] W >>> [glusterfsd.c:727:cleanup_and_exit] >>> (-->/lib64/libc.so.6(clone+0x6d) [0x37c8ed3c2d] >>> (-->/lib64/libpthread.so.0 [0x37c96064a7] >>> (-->/opt/glusterfs/3.2.5/sbin/glusterd(glusterfs_sigwaiter+0x17c) >>> [0x40477c]))) 0-: received signum (15), shutting down >>> >>> -- >>> >>> Regards, >>> >>> Cocl >>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >> hi Changliang, >> Could you specify which process crashed. Is it >> glusterd or glusterfs? Could you provide the stack trace that >> is present in it's respective logfile. I dont see any stack >> trace in the logs you have provided. >> >> Pranith >> >> >> >> >> -- >> >> Regards, >> >> Cocl >> OM manager >> 19lou Operation & Maintenance Dept > Could you send the logs of all the machines, we will check and > getback to you. > > Pranith > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20120102/176e28b6/attachment-0001.htm>