3.1.1 crashing under moderate load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



correction:
"clients are mounted on how many physical nodes are there in your test setup?"

should have been:
"clients are mounted on how many physical nodes in your test setup?"

----- Original Message -----
From: "Raghavendra G" <raghavendra at gluster.com>
To: "Lana Deere" <lana.deere at gmail.com>
Cc: gluster-users at gluster.org
Sent: Monday, December 6, 2010 9:30:57 AM
Subject: Re: 3.1.1 crashing under moderate load

Hi Lana,

I need some clarifications about test setup:

* Are you seeing problem when there are more than 25 clients? If this is the case, are these clients on different physical nodes or is it that more than one client shares same node? In other words, clients are mounted on how many physical nodes are there in your test setup? Also, are you running the command on each of these clients simultaneously?

* Or is it that there are more than 25 concurrent concurrent invocations of the script? If this is the case, how many clients are present in your test setup and on how many physical nodes these clients are mounted?

regards,
----- Original Message -----
From: "Lana Deere" <lana.deere at gmail.com>
To: gluster-users at gluster.org
Sent: Saturday, December 4, 2010 12:13:30 AM
Subject: 3.1.1 crashing under moderate load

I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
transport, native/fuse access.

I have a directory which is shared on the gluster.  In fact, it is a clone
of /lib from one of the clients, shared so all can see it.

I have a script which does
    find lib -type f -print0 | xargs -0 sum | md5sum

If I run this on my clients one at a time, they all yield the same md5sum:
    for h in <<hosts>>; do ssh $host script; done

If I run this on my clients concurrently, up to roughly 25 at a time they
still yield the same md5sum.
    for h in <<hosts>>; do ssh $host script& done

Beyond that the gluster share often, but not always, fails.  The errors vary.
    - sometimes I get "sum: xxx.so not found"
    - sometimes I get the wrong checksum without any error message
    - sometimes the job simply hangs until I kill it


Some of the server logs show messages like these from the time of the
failures (other servers show nothing from around that time):

[2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
socket (peer: 10.54.255.240:1022) after handshake is complete
[2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x55e82, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
(rdma.RaidData-server)
[2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] :
Reply submission failed
[2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic]
rpc-service: failed to submit message (XID: 0x55e83, Program:
GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
(rdma.RaidData-server)
[2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] :
Reply submission failed


On a client which had a failure I see messages like:

[2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler]
rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket
(peer: 10.54.50.101:24009) after handshake is complete
[2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
op(READ(12)) called at 2010-12-03 10:03:06.20492
[2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
op(READ(12)) called at 2010-12-03 10:03:06.20529
[2010-12-03 10:03:06.26827] I
[client-handshake.c:993:select_server_supported_programs]
RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437),
Version (310)
[2010-12-03 10:03:06.27029] I
[client-handshake.c:829:client_setvolume_cbk] RaidData-client-1:
Connected to 10.54.50.101:24009, attached to remote volume '/data'.
[2010-12-03 10:03:06.27067] I
[client-handshake.c:698:client_post_handshake] RaidData-client-1: 2
fds open - Delaying child_up until they are re-opened


Anyone else seen anything like this and/or have suggestions about options I can
set to work around this?


.. Lana (lana.deere at gmail.com)
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux