Re: Problems with RadosGW bench

Gregory Farnum <greg@xxxxxxxxxxx> · Fri, 11 Oct 2013 08:49:46 -0700



Without more details it sounds like you're just overloading the
cluster. How are the clients generating their load — is there any
throttling?
4 gateways can probably process on the order of 15k ops/second; each
of those PUT ops is going to require 3 writes to the disks on the
backend (times whatever the replication value is), so the OSDs can
probably handle 72*120/(2*3)=1440 PUTS/s; meanwhile you have 300
clients all trying to do 800k puts (=2.4 million puts, or about 2 days
of write time), and I'm guessing they're sending them out as fast as
they can generate them.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Oct 11, 2013 at 5:56 AM, Alexis GÜNST HORN
<alexis.gunsthorn@xxxxxxxxxxxx> wrote:
> Hello to all,
>
> Here is my context :
>
> - Ceph cluster composed of 72 OSDs (ie 72 disks).
> - 4 radosgw gateways
> - Round robin DNS for load balancing accross gateways
>
> My goal is to test / bench the S3 API.
>
> Here is my scenario, with 300 clients from 300 différents hosts :
>
> 1) each client uploading about 800.000 files. One bucket / client, on
> the same account
> 2) each client making recursive "ls" to get the whole list of the bucket
> 3) each  client randomly copying one object to another
> 4) each client randomly moving one object to another
> 5) each client randomly deleting an object
>
>
> Here is the result :
> 1 => OK
> 2 => OK
> but, 3, 4, 5 are both OK and KO.
>
> In fact, i get a lot of error 500 with PUT requests.
>
> Here is the Apache Log :
> [Fri Oct 11 12:46:36 2013] [error] [client xxx.xxx.xxx.xxx] FastCGI:
> comm with server "/var/www/s3gw.fcgi" aborted: idle timeout (30 sec)
> [Fri Oct 11 12:46:36 2013] [error] [client xxx.xxx.xxx.xxx] FastCGI:
> incomplete headers (0 bytes) received from server "/var/www/s3gw.fcgi"
>
> And, in radosgw logs, i have some of these lines :
> radosgw: 2013-10-07 17:12:20.843522 7f61462ad700  1 heartbeat_map
> is_healthy 'RGWProcess::m_tp thread 0x7f60e4fa7700' had timed out
> after 600
>
> and
>
> radosgw: 2013-10-07 17:12:14.027608 7f61007d3700  1 heartbeat_map
> reset_timeout 'RGWProcess::m_tp thread 0x7f61007d3700' had timed out
> after 600
>
> But, the Ceph cluster is still OK (HEALTH_OK).
>
> Here are my options for radosgw :
>
> [client.radosgw.gateway]
> host = xxxx
> keyring = /etc/ceph/keyring.radosgw.gateway
> rgw socket path = /tmp/radosgw.sock
> rgw enable ops log = false
> rgw print continue = false
> rgw enable usage log = true
> debug rgw = 0
> rgw usage log tick interval = 30
> rgw usage log flush threshold = 1024
> rgw usage max shards = 32
> rgw usage max user shards = 1
> rgw dns name = xxxx
> rgw thread pool size = 150
> rgw gc max objs = 64
>
> Do you have any idea to explain theses Errors 500 ?
>
> Thanks a lot for your help
> Alexis
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com