3.1.1 crashing under moderate load

raghavendra at gluster.com (Raghavendra G) · Sun, 5 Dec 2010 23:41:07 -0600 (CST)

Also what is the configuration are you using? Is it replicate or distribute or striped or single client and server? Can you attach entire client and server log files?

----- Original Message -----
From: "Raghavendra G" <raghavendra at gluster.com>
To: "Lana Deere" <lana.deere at gmail.com>
Cc: gluster-users at gluster.org
Sent: Monday, December 6, 2010 9:32:36 AM
Subject: Re: 3.1.1 crashing under moderate load

correction:
"clients are mounted on how many physical nodes are there in your test setup?"

should have been:
"clients are mounted on how many physical nodes in your test setup?"

----- Original Message -----
From: "Raghavendra G" <raghavendra at gluster.com>
To: "Lana Deere" <lana.deere at gmail.com>
Cc: gluster-users at gluster.org
Sent: Monday, December 6, 2010 9:30:57 AM
Subject: Re: 3.1.1 crashing under moderate load

Hi Lana,

I need some clarifications about test setup:

* Are you seeing problem when there are more than 25 clients? If this is the case, are these clients on different physical nodes or is it that more than one client shares same node? In other words, clients are mounted on how many physical nodes are there in your test setup? Also, are you running the command on each of these clients simultaneously?

* Or is it that there are more than 25 concurrent concurrent invocations of the script? If this is the case, how many clients are present in your test setup and on how many physical nodes these clients are mounted?

regards,
----- Original Message -----
From: "Lana Deere" <lana.deere at gmail.com>
To: gluster-users at gluster.org
Sent: Saturday, December 4, 2010 12:13:30 AM
Subject: 3.1.1 crashing under moderate load

I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
transport, native/fuse access.

I have a directory which is shared on the gluster.  In fact, it is a clone
of /lib from one of the clients, shared so all can see it.

I have a script which does
    find lib -type f -print0 | xargs -0 sum | md5sum

If I run this on my clients one at a time, they all yield the same md5sum:
    for h in <<hosts>>; do ssh $host script; done

If I run this on my clients concurrently, up to roughly 25 at a time they
still yield the same md5sum.
    for h in <<hosts>>; do ssh $host script& done

Beyond that the gluster share often, but not always, fails.  The errors vary.
    - sometimes I get "sum: xxx.so not found"
    - sometimes I get the wrong checksum without any error message
    - sometimes the job simply hangs until I kill it

Some of the server logs show messages like these from the time of the
failures (other servers show nothing from around that time):

[2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
socket (peer: 10.54.255.240:1022) after handshake is complete
[2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x55e82, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
(rdma.RaidData-server)
[2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] :
Reply submission failed
[2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x55e83, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
(rdma.RaidData-server)
[2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] :
Reply submission failed

On a client which had a failure I see messages like:

[2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler]
rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket
(peer: 10.54.50.101:24009) after handshake is complete
[2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
op(READ(12)) called at 2010-12-03 10:03:06.20492
[2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
op(READ(12)) called at 2010-12-03 10:03:06.20529
[2010-12-03 10:03:06.26827] I
[client-handshake.c:993:select_server_supported_programs]
RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437),
Version (310)
[2010-12-03 10:03:06.27029] I
[client-handshake.c:829:client_setvolume_cbk] RaidData-client-1:
Connected to 10.54.50.101:24009, attached to remote volume '/data'.
[2010-12-03 10:03:06.27067] I
[client-handshake.c:698:client_post_handshake] RaidData-client-1: 2
fds open - Delaying child_up until they are re-opened

Anyone else seen anything like this and/or have suggestions about options I can
set to work around this?

.. Lana (lana.deere at gmail.com)
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users