Re: Rados bench result when increasing OSDs

Guang Yang <yguang11@xxxxxxxxx> · Thu, 24 Oct 2013 22:11:43 +0800

Thanks Mark.

I cannot connect to my hosts, I will do the check and get back to you tomorrow.

Thanks,
Guang

在 2013-10-24，下午9:47，Mark Nelson <mark.nelson@xxxxxxxxxxx> 写道：

> On 10/24/2013 08:31 AM, Guang Yang wrote:
>> Hi Mark, Greg and Kyle,
>> Sorry to response this late, and thanks for providing the directions for 
>> me to look at.
>> 
>> We have exact the same setup for OSD, pool replica (and even I tried to 
>> create the same number of PGs within the small cluster), however, I can 
>> still reproduce this constantly.
>> 
>> This is the command I run:
>> $ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write
>> 
>> With 24 OSDs:
>> Average Latency: 0.00494123
>> Max latency:     0.511864
>> Min latency:      0.002198
>> 
>> With 330 OSDs:
>> Average Latency:    0.00913806
>> Max latency:             0.021967
>> Min latency:              0.005456
>> 
>> In terms of the crush rule, we are using the default one, for the small 
>> cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we 
>> have 30 OSD hosts (11 * 30).
>> 
>> I have a couple of questions:
>>  1. Is it possible that latency is due to that we have only three layer 
>> hierarchy? like root -> host -> OSD, and as we are using the Straw (by 
>> default) bucket type, which has O(N) speed, and if host number increase, 
>> so that the computation actually increase. I suspect not as the 
>> computation is in the order of microseconds per my understanding.
> 
> I suspect this is very unlikely as well.
> 
>> 
>>  2. Is it possible because we have more OSDs, the cluster will need to 
>> maintain far more connections between OSDs which potentially slow things 
>> down?
> 
> One thing here that might be very interesting is this:
> 
> After you run your tests, if you do something like:
> 
> find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {}
> dump_historic_ops \; > foo
> 
> on each OSD server, you will get a dump of the 10 slowest operations
> over the last 10 minutes for each OSD on each server, and it will tell
> you were in each OSD operations were backing up.  You can sort of search
> through these files by greping for "duration" first, looking for the
> long ones, and then going back and searching through the file for those
> long durations and looking at the associated latencies.
> 
> Something I have been investigating recently is time spent waiting for
> osdmap propagation.  It's something I haven't had time to dig into
> meaningfully, but if we were to see that this was more significant on
> your larger cluster vs your smaller one, that would be very interesting
> news.
> 
>> 
>>  3. Anything else i might miss?
>> 
>> Thanks all for the constant help.
>> 
>> Guang
>> 
>> 
>> 在 2013-10-22，下午10:22，Guang Yang <yguang11@xxxxxxxxx 
>> <mailto:yguang11@xxxxxxxxx>> 写道：
>> 
>>> Hi Kyle and Greg,
>>> I will get back to you with more details tomorrow, thanks for the 
>>> response.
>>> 
>>> Thanks,
>>> Guang
>>> 在 2013-10-22，上午9:37，Kyle Bader <kyle.bader@xxxxxxxxx 
>>> <mailto:kyle.bader@xxxxxxxxx>> 写道：
>>> 
>>>> Besides what Mark and Greg said it could be due to additional hops 
>>>> through network devices. What network devices are you using, what is 
>>>> the network  topology and does your CRUSH map reflect the network 
>>>> topology?
>>>> 
>>>> On Oct 21, 2013 9:43 AM, "Gregory Farnum" <greg@xxxxxxxxxxx 
>>>> <mailto:greg@xxxxxxxxxxx>> wrote:
>>>> 
>>>>    On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang <yguang11@xxxxxxxxx
>>>>    <mailto:yguang11@xxxxxxxxx>> wrote:
>>>>> Dear ceph-users,
>>>>> Recently I deployed a ceph cluster with RadosGW, from a small
>>>>    one (24 OSDs) to a much bigger one (330 OSDs).
>>>>> 
>>>>> When using rados bench to test the small cluster (24 OSDs), it
>>>>    showed the average latency was around 3ms (object size is 5K),
>>>>    while for the larger one (330 OSDs), the average latency was
>>>>    around 7ms (object size 5K), twice comparing the small cluster.
>>>>> 
>>>>> The OSD within the two cluster have the same configuration, SAS
>>>>    disk,  and two partitions for one disk, one for journal and the
>>>>    other for metadata.
>>>>> 
>>>>> For PG numbers, the small cluster tested with the pool having
>>>>    100 PGs, and for the large cluster, the pool has 43333 PGs (as I
>>>>    will to further scale the cluster, so I choose a much large PG).
>>>>> 
>>>>> Does my test result make sense? Like when the PG number and OSD
>>>>    increase, the latency might drop?
>>>> 
>>>>    Besides what Mark said, can you describe your test in a little more
>>>>    detail? Writing/reading, length of time, number of objects, etc.
>>>>    -Greg
>>>>    Software Engineer #42 @ http://inktank.com <http://inktank.com/>
>>>>    | http://ceph.com <http://ceph.com/>
>>>>    _______________________________________________
>>>>    ceph-users mailing list
>>>>    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>>    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>> 
>> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com