Re: Global, Synchronous Blocked Requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Fri, Nov 27, 2015 at 9:52 PM, Daniel Maraio <dmaraio@xxxxxxxxxx> wrote:
Hello,

  Can you provide some further details. What are the size of your objects, how many objects do you have in your buckets. Are you using bucket index sharding, are you sharding your objects over multiple buckets? Is the cluster doing any scrubbing during these periods? It sounds like you may be having trouble with your rgw bucket index. In our cluster, much smaller than yours mind you, it was necessary to put the rgw bucket index onto it's own set of osds to isolate it from the rest of the cluster IO. We are still using single object bucket indexes but have a plan to move to sharded bucket index eventually.

In order (and I apologize if I'm conflating S3 buckets with Ceph buckets here):

 - Since this is an S3 cluster, object sizes range from a few bytes to tens of GB.  On average, most objects are around a MB or two.
 - We currently have 41.4M objects in the cluster.  Some buckets have a few objects, some have several million.
 - Yes, we are using bucket index sharding
 - Objects are sharded (7,2 erasure coding), and the crush map is set up such that each PG is going to contain an OSD from each physical server (apologies if I'm misunderstanding bucket here)
 - Scrubbing runs on the default schedule, so there has been no more or less scrubbing going on during this incident than before, when things were working well.  Scrubbing operations kick off periodically throughout the day and complete in a few minutes' time.

That reminds me -- we also disabled scrubbing for several hours, and we noticed no decrease in the rate of slow requests.

 

  You should determine what OSDs your bucket indexes are located on and see if a pattern emerges with the OSDs have have slow requests during this periods. You can use the command ' ceph pg ls-by-pool .rgw.buckets.index ' to show what pgs/osds the bucket index resides on.

- Daniel

When I run this (and with a little bash-fu), I see 509 of my 648 OSDs marked as an acting primary for the 1024 pools here.  It'll take some digging to see what relationship exists between these OSDs and the ones marked as slow.

I appreciate the quick response. 

Brian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux