Re: Unexpectedly slow write performance (RBD cinder volumes)

Oliver Daudey <oliver@xxxxxxxxx> · Thu, 22 Aug 2013 23:47:49 +0200

Hey Greg,

Thanks for the tip!  I was assuming a clean shutdown of the OSD should
flush the journal for you and have the OSD try to exit with it's
data-store in a clean state?  Otherwise, I would first have to stop
updates a that particular OSD, then flush the journal, then stop it?

   Regards,

      Oliver

On do, 2013-08-22 at 14:34 -0700, Gregory Farnum wrote:
> On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
> > Hey Greg,
> >
> > I encountered a similar problem and we're just in the process of
> > tracking it down here on the list.  Try downgrading your OSD-binaries to
> > 0.61.8 Cuttlefish and re-test.  If it's significantly faster on RBD,
> > you're probably experiencing the same problem I have with Dumpling.
> >
> > PS: Only downgrade your OSDs.  Cuttlefish-monitors don't seem to want to
> > start with a database that has been touched by a Dumpling-monitor and
> > don't talk to them, either.
> >
> > PPS: I've also had OSDs no longer start with an assert while processing
> > the journal during these upgrade/downgrade-tests, mostly when coming
> > down from Dumpling to Cuttlefish.  If you encounter those, delete your
> > journal and re-create with `ceph-osd -i <OSD-ID> --mkjournal'.  Your
> > data-store will be OK, as far as I can tell.
> 
> Careful — deleting the journal is potentially throwing away updates to
> your data store! If this is a problem you should flush the journal
> with the dumpling binary before downgrading.
> 
> >
> >
> >    Regards,
> >
> >      Oliver
> >
> > On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote:
> >> I have been benchmarking our Ceph installation for the last week or
> >> so, and I've come across an issue that I'm having some difficulty
> >> with.
> >>
> >>
> >> Ceph bench reports reasonable write throughput at the OSD level:
> >>
> >>
> >> ceph tell osd.0 bench
> >> { "bytes_written": 1073741824,
> >>   "blocksize": 4194304,
> >>   "bytes_per_sec": "47288267.000000"}
> >>
> >>
> >> Running this across all OSDs produces on average 50-55 MB/s, which is
> >> fine with us. We were expecting around 100 MB/s / 2 (journal and OSD
> >> on same disk, separate partitions).
> >>
> >>
> >> What I wasn't expecting was the following:
> >>
> >>
> >> I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing
> >> against 33 OSDs. Aggregate write throughput peaked under 400 MB/s:
> >>
> >>
> >> 1  196.013671875
> >> 2  285.8759765625
> >> 4  351.9169921875
> >> 8  386.455078125
> >> 16 363.8583984375
> >> 24 353.6298828125
> >> 32 348.9697265625
> >>
> >>
> >>
> >> I was hoping to see something closer to # OSDs * Average value for
> >> ceph bench (approximately 1.2 GB/s peak aggregate write throughput).
> >>
> >>
> >> We're seeing excellent read, randread performance, but writes are a
> >> bit of a bother.
> >>
> >>
> >> Does anyone have any suggestions?
> You don't appear to have accounted for the 2x replication (where all
> writes go to two OSDs) in these calculations. I assume your pools have
> size 2 (or 3?) for these tests. 3 would explain the performance
> difference entirely; 2x replication leaves it still a bit low but
> takes the difference down to ~350/600 instead of ~350/1200. :)
> You mentioned that your average osd bench throughput was ~50MB/s;
> what's the range? Have you run any rados bench tests? What is your PG
> count across the cluster?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com