Re: Unexpectedly slow write performance (RBD cinder volumes)

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 22 Aug 2013 14:34:52 -0700

On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
> Hey Greg,
>
> I encountered a similar problem and we're just in the process of
> tracking it down here on the list.  Try downgrading your OSD-binaries to
> 0.61.8 Cuttlefish and re-test.  If it's significantly faster on RBD,
> you're probably experiencing the same problem I have with Dumpling.
>
> PS: Only downgrade your OSDs.  Cuttlefish-monitors don't seem to want to
> start with a database that has been touched by a Dumpling-monitor and
> don't talk to them, either.
>
> PPS: I've also had OSDs no longer start with an assert while processing
> the journal during these upgrade/downgrade-tests, mostly when coming
> down from Dumpling to Cuttlefish.  If you encounter those, delete your
> journal and re-create with `ceph-osd -i <OSD-ID> --mkjournal'.  Your
> data-store will be OK, as far as I can tell.

Careful — deleting the journal is potentially throwing away updates to
your data store! If this is a problem you should flush the journal
with the dumpling binary before downgrading.

>
>
>    Regards,
>
>      Oliver
>
> On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote:
>> I have been benchmarking our Ceph installation for the last week or
>> so, and I've come across an issue that I'm having some difficulty
>> with.
>>
>>
>> Ceph bench reports reasonable write throughput at the OSD level:
>>
>>
>> ceph tell osd.0 bench
>> { "bytes_written": 1073741824,
>>   "blocksize": 4194304,
>>   "bytes_per_sec": "47288267.000000"}
>>
>>
>> Running this across all OSDs produces on average 50-55 MB/s, which is
>> fine with us. We were expecting around 100 MB/s / 2 (journal and OSD
>> on same disk, separate partitions).
>>
>>
>> What I wasn't expecting was the following:
>>
>>
>> I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing
>> against 33 OSDs. Aggregate write throughput peaked under 400 MB/s:
>>
>>
>> 1  196.013671875
>> 2  285.8759765625
>> 4  351.9169921875
>> 8  386.455078125
>> 16 363.8583984375
>> 24 353.6298828125
>> 32 348.9697265625
>>
>>
>>
>> I was hoping to see something closer to # OSDs * Average value for
>> ceph bench (approximately 1.2 GB/s peak aggregate write throughput).
>>
>>
>> We're seeing excellent read, randread performance, but writes are a
>> bit of a bother.
>>
>>
>> Does anyone have any suggestions?
You don't appear to have accounted for the 2x replication (where all
writes go to two OSDs) in these calculations. I assume your pools have
size 2 (or 3?) for these tests. 3 would explain the performance
difference entirely; 2x replication leaves it still a bit low but
takes the difference down to ~350/600 instead of ~350/1200. :)
You mentioned that your average osd bench throughput was ~50MB/s;
what's the range? Have you run any rados bench tests? What is your PG
count across the cluster?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com