On 02/21/2015 12:12 PM, Jeffrey McDonald wrote:
Hi,
We have a ceph Giant installation with a radosgw interface. There are
198 OSDs on seven OSD servers and we're seeing OSD failures on the
system when users try to write files via the s3 interface. We're more
likely to see the failures if the files are larger than 1 GB and if the
files go to a newly created bucket. We have seen failures for older
buckets but that seem to happen less frequently. I can regularly crash
the OSD with a 3.6 GB file writing to a newly created bucket.
Three weeks ago, we upgraded to Giant from firefly to achieve better
performance. Under firefly it was impossible to break the system.
We have had these issues since we've moved to giant. We've gone
through tests with iptables, sysctl parameters and testing different
versions of s3cmd (along with different python versions), there is no
indication that any of these matter for the failures.
Hi Jeff,
Did increasing the heartbeat grace period on the OSDs and the Monitors
help at all? Any other system logging information on the OSDs that
might show any interesting behavior (excessive major pagefaults, high
CPU, etc)? Can you reproduce it with RADOS bench and/or RBD instead of
with RGW?
From the logs we saw earlier it looks like multiple peers are claiming
a lack of heartbeat after 20s from the OSD(s). I think that's either
got to be a network/firewall issue or something is making the OSD
heartbeat extremely laggy. That's probably where I'd focus efforts.
For posterity, another user saw something similar when transitioning
from Firefly to Giant, but I'm not sure it was every resolved:
http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-November/044727.html
The last message in the thread indicates that it may be related to
deep-scrub.
Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com