Re: Rados bench result when increasing OSDs

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Thu, 24 Oct 2013 08:47:50 -0500

On 10/24/2013 08:31 AM, Guang Yang wrote:
> Hi Mark, Greg and Kyle,
> Sorry to response this late, and thanks for providing the directions for 
> me to look at.
> 
> We have exact the same setup for OSD, pool replica (and even I tried to 
> create the same number of PGs within the small cluster), however, I can 
> still reproduce this constantly.
> 
> This is the command I run:
> $ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write
> 
> With 24 OSDs:
> Average Latency: 0.00494123
> Max latency:     0.511864
> Min latency:      0.002198
> 
> With 330 OSDs:
> Average Latency:    0.00913806
> Max latency:             0.021967
> Min latency:              0.005456
> 
> In terms of the crush rule, we are using the default one, for the small 
> cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we 
> have 30 OSD hosts (11 * 30).
> 
> I have a couple of questions:
>   1. Is it possible that latency is due to that we have only three layer 
> hierarchy? like root -> host -> OSD, and as we are using the Straw (by 
> default) bucket type, which has O(N) speed, and if host number increase, 
> so that the computation actually increase. I suspect not as the 
> computation is in the order of microseconds per my understanding.

I suspect this is very unlikely as well.

> 
>   2. Is it possible because we have more OSDs, the cluster will need to 
> maintain far more connections between OSDs which potentially slow things 
> down?

One thing here that might be very interesting is this:

After you run your tests, if you do something like:

find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {}
dump_historic_ops \; > foo

on each OSD server, you will get a dump of the 10 slowest operations
over the last 10 minutes for each OSD on each server, and it will tell
you were in each OSD operations were backing up.  You can sort of search
through these files by greping for "duration" first, looking for the
long ones, and then going back and searching through the file for those
long durations and looking at the associated latencies.

Something I have been investigating recently is time spent waiting for
osdmap propagation.  It's something I haven't had time to dig into
meaningfully, but if we were to see that this was more significant on
your larger cluster vs your smaller one, that would be very interesting
news.

> 
>   3. Anything else i might miss?
> 
> Thanks all for the constant help.
> 
> Guang
> 
> 
> 在 2013-10-22，下午10:22，Guang Yang <yguang11@xxxxxxxxx 
> <mailto:yguang11@xxxxxxxxx>> 写道：
> 
>> Hi Kyle and Greg,
>> I will get back to you with more details tomorrow, thanks for the 
>> response.
>>
>> Thanks,
>> Guang
>> 在 2013-10-22，上午9:37，Kyle Bader <kyle.bader@xxxxxxxxx 
>> <mailto:kyle.bader@xxxxxxxxx>> 写道：
>>
>>> Besides what Mark and Greg said it could be due to additional hops 
>>> through network devices. What network devices are you using, what is 
>>> the network  topology and does your CRUSH map reflect the network 
>>> topology?
>>>
>>> On Oct 21, 2013 9:43 AM, "Gregory Farnum" <greg@xxxxxxxxxxx 
>>> <mailto:greg@xxxxxxxxxxx>> wrote:
>>>
>>>     On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang <yguang11@xxxxxxxxx
>>>     <mailto:yguang11@xxxxxxxxx>> wrote:
>>>     > Dear ceph-users,
>>>     > Recently I deployed a ceph cluster with RadosGW, from a small
>>>     one (24 OSDs) to a much bigger one (330 OSDs).
>>>     >
>>>     > When using rados bench to test the small cluster (24 OSDs), it
>>>     showed the average latency was around 3ms (object size is 5K),
>>>     while for the larger one (330 OSDs), the average latency was
>>>     around 7ms (object size 5K), twice comparing the small cluster.
>>>     >
>>>     > The OSD within the two cluster have the same configuration, SAS
>>>     disk,  and two partitions for one disk, one for journal and the
>>>     other for metadata.
>>>     >
>>>     > For PG numbers, the small cluster tested with the pool having
>>>     100 PGs, and for the large cluster, the pool has 43333 PGs (as I
>>>     will to further scale the cluster, so I choose a much large PG).
>>>     >
>>>     > Does my test result make sense? Like when the PG number and OSD
>>>     increase, the latency might drop?
>>>
>>>     Besides what Mark said, can you describe your test in a little more
>>>     detail? Writing/reading, length of time, number of objects, etc.
>>>     -Greg
>>>     Software Engineer #42 @ http://inktank.com <http://inktank.com/>
>>>     | http://ceph.com <http://ceph.com/>
>>>     _______________________________________________
>>>     ceph-users mailing list
>>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com