Bad drive caused radosgw to timeout with http 500s

"Jeppesen, Nelson" <Nelson.Jeppesen@xxxxxxxxxx> · Wed, 20 Mar 2013 15:42:54 -0700

Hello Ceph-Users,

I was testing our rados gateway and after a few hours rgw started sending http 500 responses for certain uploads. I did some digging and found that a HDD died. The OSD was marked out, but not after a short rgw outage. Start to finish was 60 to 120 seconds.

I have a few questions;

1) Fastcgi timed out after 30 seconds. If I raise the timeout to 120 seconds, will that protect me from future HDD failures? 
	Example of the error.log from apache:

	[error] [client 10.194.255.14] FastCGI: incomplete headers (0 bytes) received from server "/var/www/s3gw.fcgi"
	[error] [client 10.194.255.1] FastCGI: comm with server "/var/www/s3gw.fcgi" aborted: idle timeout (30 sec)

2) Why did it take so long for Ceph to recover? 

3) Anything I can to improve HDD failure resiliency?

Thank you. 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com