Thanks Mark. I cannot connect to my hosts, I will do the check and get back to you tomorrow. Thanks, Guang 在 2013-10-24,下午9:47,Mark Nelson <mark.nelson@xxxxxxxxxxx> 写道: > On 10/24/2013 08:31 AM, Guang Yang wrote: >> Hi Mark, Greg and Kyle, >> Sorry to response this late, and thanks for providing the directions for >> me to look at. >> >> We have exact the same setup for OSD, pool replica (and even I tried to >> create the same number of PGs within the small cluster), however, I can >> still reproduce this constantly. >> >> This is the command I run: >> $ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write >> >> With 24 OSDs: >> Average Latency: 0.00494123 >> Max latency: 0.511864 >> Min latency: 0.002198 >> >> With 330 OSDs: >> Average Latency: 0.00913806 >> Max latency: 0.021967 >> Min latency: 0.005456 >> >> In terms of the crush rule, we are using the default one, for the small >> cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we >> have 30 OSD hosts (11 * 30). >> >> I have a couple of questions: >> 1. Is it possible that latency is due to that we have only three layer >> hierarchy? like root -> host -> OSD, and as we are using the Straw (by >> default) bucket type, which has O(N) speed, and if host number increase, >> so that the computation actually increase. I suspect not as the >> computation is in the order of microseconds per my understanding. > > I suspect this is very unlikely as well. > >> >> 2. Is it possible because we have more OSDs, the cluster will need to >> maintain far more connections between OSDs which potentially slow things >> down? > > One thing here that might be very interesting is this: > > After you run your tests, if you do something like: > > find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} > dump_historic_ops \; > foo > > on each OSD server, you will get a dump of the 10 slowest operations > over the last 10 minutes for each OSD on each server, and it will tell > you were in each OSD operations were backing up. You can sort of search > through these files by greping for "duration" first, looking for the > long ones, and then going back and searching through the file for those > long durations and looking at the associated latencies. > > Something I have been investigating recently is time spent waiting for > osdmap propagation. It's something I haven't had time to dig into > meaningfully, but if we were to see that this was more significant on > your larger cluster vs your smaller one, that would be very interesting > news. > >> >> 3. Anything else i might miss? >> >> Thanks all for the constant help. >> >> Guang >> >> >> 在 2013-10-22,下午10:22,Guang Yang <yguang11@xxxxxxxxx >> <mailto:yguang11@xxxxxxxxx>> 写道: >> >>> Hi Kyle and Greg, >>> I will get back to you with more details tomorrow, thanks for the >>> response. >>> >>> Thanks, >>> Guang >>> 在 2013-10-22,上午9:37,Kyle Bader <kyle.bader@xxxxxxxxx >>> <mailto:kyle.bader@xxxxxxxxx>> 写道: >>> >>>> Besides what Mark and Greg said it could be due to additional hops >>>> through network devices. What network devices are you using, what is >>>> the network topology and does your CRUSH map reflect the network >>>> topology? >>>> >>>> On Oct 21, 2013 9:43 AM, "Gregory Farnum" <greg@xxxxxxxxxxx >>>> <mailto:greg@xxxxxxxxxxx>> wrote: >>>> >>>> On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang <yguang11@xxxxxxxxx >>>> <mailto:yguang11@xxxxxxxxx>> wrote: >>>>> Dear ceph-users, >>>>> Recently I deployed a ceph cluster with RadosGW, from a small >>>> one (24 OSDs) to a much bigger one (330 OSDs). >>>>> >>>>> When using rados bench to test the small cluster (24 OSDs), it >>>> showed the average latency was around 3ms (object size is 5K), >>>> while for the larger one (330 OSDs), the average latency was >>>> around 7ms (object size 5K), twice comparing the small cluster. >>>>> >>>>> The OSD within the two cluster have the same configuration, SAS >>>> disk, and two partitions for one disk, one for journal and the >>>> other for metadata. >>>>> >>>>> For PG numbers, the small cluster tested with the pool having >>>> 100 PGs, and for the large cluster, the pool has 43333 PGs (as I >>>> will to further scale the cluster, so I choose a much large PG). >>>>> >>>>> Does my test result make sense? Like when the PG number and OSD >>>> increase, the latency might drop? >>>> >>>> Besides what Mark said, can you describe your test in a little more >>>> detail? Writing/reading, length of time, number of objects, etc. >>>> -Greg >>>> Software Engineer #42 @ http://inktank.com <http://inktank.com/> >>>> | http://ceph.com <http://ceph.com/> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com