On Fri, Sep 20, 2013 at 4:47 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > On Fri, Sep 20, 2013 at 5:27 AM, Andreas Joachim Peters > <Andreas.Joachim.Peters@xxxxxxx> wrote: >> Hi, >> >> >> we made some benchmarks about object read/write latencies on the CERN ceph installation. >> >> The cluster has 44 nodes and ~1k disks, all on 10GE and the pool configuration has 3 copies. >> Client & Server is 0.67. >> >> The latencies we observe (using tiny objects ... 5 bytes) on the idle pool: > > Does that mean you have non-idle pools in the same cluster? Unless > you've got physical separation, the fact that the pool is idle doesn't > mean much unless the cluster is as well. (Though if you're getting 60 > object writes/hard drive/second I think it probably is idle.) The cluster is pretty idle -- we'll separate with some new hardware to exclude other usage from this test. > >> write full object(sync) ~65-80ms >> append to object ~60-75ms >> set xattr object ~65-80ms >> lock object ~65-80ms >> stat object ~1ms > > Anecdotally those write times look a little high to me, but my > expectations are probably set for 2x and I'm not sure how much > difference that makes (I would expect not much, but maybe there's > something happening I haven't considered). > > >> We seem to saturate the pools writing ~ 20k objects/s (= internally 60k/s). >> >> Is there an easy explanation for 80 ms (quasi without payload) and a possible tuning to reduce that? >> I measured (append few bytes +fsync) on such a disk around 33ms which explains probably part of the latency. > Ah, and that's also higher than I would normally expect for a disk > access, so that's probably why the above numbers seem a little large. > Separately, what's your journal config? Does each spindle have a > partition? This math all works out to about what I'd expect if so. The journals are on the same spinning disk & partition as the OSD data. Each spinning disk is 1 OSD. Cheers, Dan > >> Then I tried with the async API to see if there is a difference in the measurement between wait_for_complete or wait_for_safe ... shouldn't wait_for_complete be much shorter, but I get always comparable results ... > > You're presumably on xfs? With non-btrfs FSes, the OSDs have to use > write-ahead journaling so they always commit the op to disk before > applying to the local FS. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html