Without more details it sounds like you're just overloading the cluster. How are the clients generating their load — is there any throttling? 4 gateways can probably process on the order of 15k ops/second; each of those PUT ops is going to require 3 writes to the disks on the backend (times whatever the replication value is), so the OSDs can probably handle 72*120/(2*3)=1440 PUTS/s; meanwhile you have 300 clients all trying to do 800k puts (=2.4 million puts, or about 2 days of write time), and I'm guessing they're sending them out as fast as they can generate them. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Oct 11, 2013 at 5:56 AM, Alexis GÜNST HORN <alexis.gunsthorn@xxxxxxxxxxxx> wrote: > Hello to all, > > Here is my context : > > - Ceph cluster composed of 72 OSDs (ie 72 disks). > - 4 radosgw gateways > - Round robin DNS for load balancing accross gateways > > My goal is to test / bench the S3 API. > > Here is my scenario, with 300 clients from 300 différents hosts : > > 1) each client uploading about 800.000 files. One bucket / client, on > the same account > 2) each client making recursive "ls" to get the whole list of the bucket > 3) each client randomly copying one object to another > 4) each client randomly moving one object to another > 5) each client randomly deleting an object > > > Here is the result : > 1 => OK > 2 => OK > but, 3, 4, 5 are both OK and KO. > > In fact, i get a lot of error 500 with PUT requests. > > Here is the Apache Log : > [Fri Oct 11 12:46:36 2013] [error] [client xxx.xxx.xxx.xxx] FastCGI: > comm with server "/var/www/s3gw.fcgi" aborted: idle timeout (30 sec) > [Fri Oct 11 12:46:36 2013] [error] [client xxx.xxx.xxx.xxx] FastCGI: > incomplete headers (0 bytes) received from server "/var/www/s3gw.fcgi" > > And, in radosgw logs, i have some of these lines : > radosgw: 2013-10-07 17:12:20.843522 7f61462ad700 1 heartbeat_map > is_healthy 'RGWProcess::m_tp thread 0x7f60e4fa7700' had timed out > after 600 > > and > > radosgw: 2013-10-07 17:12:14.027608 7f61007d3700 1 heartbeat_map > reset_timeout 'RGWProcess::m_tp thread 0x7f61007d3700' had timed out > after 600 > > But, the Ceph cluster is still OK (HEALTH_OK). > > Here are my options for radosgw : > > [client.radosgw.gateway] > host = xxxx > keyring = /etc/ceph/keyring.radosgw.gateway > rgw socket path = /tmp/radosgw.sock > rgw enable ops log = false > rgw print continue = false > rgw enable usage log = true > debug rgw = 0 > rgw usage log tick interval = 30 > rgw usage log flush threshold = 1024 > rgw usage max shards = 32 > rgw usage max user shards = 1 > rgw dns name = xxxx > rgw thread pool size = 150 > rgw gc max objs = 64 > > Do you have any idea to explain theses Errors 500 ? > > Thanks a lot for your help > Alexis > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com