Hey Greg, Thanks for the tip! I was assuming a clean shutdown of the OSD should flush the journal for you and have the OSD try to exit with it's data-store in a clean state? Otherwise, I would first have to stop updates a that particular OSD, then flush the journal, then stop it? Regards, Oliver On do, 2013-08-22 at 14:34 -0700, Gregory Farnum wrote: > On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: > > Hey Greg, > > > > I encountered a similar problem and we're just in the process of > > tracking it down here on the list. Try downgrading your OSD-binaries to > > 0.61.8 Cuttlefish and re-test. If it's significantly faster on RBD, > > you're probably experiencing the same problem I have with Dumpling. > > > > PS: Only downgrade your OSDs. Cuttlefish-monitors don't seem to want to > > start with a database that has been touched by a Dumpling-monitor and > > don't talk to them, either. > > > > PPS: I've also had OSDs no longer start with an assert while processing > > the journal during these upgrade/downgrade-tests, mostly when coming > > down from Dumpling to Cuttlefish. If you encounter those, delete your > > journal and re-create with `ceph-osd -i <OSD-ID> --mkjournal'. Your > > data-store will be OK, as far as I can tell. > > Careful — deleting the journal is potentially throwing away updates to > your data store! If this is a problem you should flush the journal > with the dumpling binary before downgrading. > > > > > > > Regards, > > > > Oliver > > > > On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote: > >> I have been benchmarking our Ceph installation for the last week or > >> so, and I've come across an issue that I'm having some difficulty > >> with. > >> > >> > >> Ceph bench reports reasonable write throughput at the OSD level: > >> > >> > >> ceph tell osd.0 bench > >> { "bytes_written": 1073741824, > >> "blocksize": 4194304, > >> "bytes_per_sec": "47288267.000000"} > >> > >> > >> Running this across all OSDs produces on average 50-55 MB/s, which is > >> fine with us. We were expecting around 100 MB/s / 2 (journal and OSD > >> on same disk, separate partitions). > >> > >> > >> What I wasn't expecting was the following: > >> > >> > >> I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing > >> against 33 OSDs. Aggregate write throughput peaked under 400 MB/s: > >> > >> > >> 1 196.013671875 > >> 2 285.8759765625 > >> 4 351.9169921875 > >> 8 386.455078125 > >> 16 363.8583984375 > >> 24 353.6298828125 > >> 32 348.9697265625 > >> > >> > >> > >> I was hoping to see something closer to # OSDs * Average value for > >> ceph bench (approximately 1.2 GB/s peak aggregate write throughput). > >> > >> > >> We're seeing excellent read, randread performance, but writes are a > >> bit of a bother. > >> > >> > >> Does anyone have any suggestions? > You don't appear to have accounted for the 2x replication (where all > writes go to two OSDs) in these calculations. I assume your pools have > size 2 (or 3?) for these tests. 3 would explain the performance > difference entirely; 2x replication leaves it still a bit low but > takes the difference down to ~350/600 instead of ~350/1200. :) > You mentioned that your average osd bench throughput was ~50MB/s; > what's the range? Have you run any rados bench tests? What is your PG > count across the cluster? > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com