RadosGW/Ceph Broken Pipe(?) error / Degraded PGs

Stefan Paul Stockinger <paul.stockinger@xxxxxxxxxx> · Wed, 12 Jun 2013 10:48:43 +0200

Hi, 

while testing a setup with ceph and Radosgw for storing document files, we
encountered some problems.
We fill in files to RadosGW (tested both with S3 and Swift) with a little perl script to see how the
cluster behaves, putting 100k+ files with different sizes(1kb-1mb) in the
cluster (in parallel with 100-1000 threads).
After some time the ceph status shows repeatadly following error messages:

2013-06-12 10:02:18.583317 7f0fa57ee700  0 -- 192.168.3.170:0/46368 >>
192.168.3.170:6789/0 pipe(0x7f0fa000ff30 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-06-12 10:02:24.584061 7f0fabf40700  0 -- 192.168.3.170:0/46368 >>
192.168.3.170:6789/0 pipe(0x7f0fa0006a50 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
...

That leaves the cluster sometimes in a state where we cannot write to it or
delete objects anymore. After restarting the ceph service, we get this: 

HEALTH_WARN 475 pgs degraded; 475 pgs stuck unclean; recovery 7    7899/316778
degraded (24.591%)

After some minutes up to one hour the health is back to "ok" and cluster
operations can be made without problems.

This problem occurred on a cluster we already run virtual machines on, the
VMs where never affected. On a test cluster we have the same issues.

Setup test cluster:
Two servers running debian squeeze with kernel 3.6.7-amd64 and ceph Bobtail,
each having two OSDs and one Mon.
Two servers on debian for running RadosGW (VMs).  

Any idea what could be the cause of this problems?

BR Paul
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com