Re: newstore performance update

Sage Weil <sweil@xxxxxxxxxx> · Mon, 4 May 2015 11:08:26 -0700 (PDT)

On Mon, 4 May 2015, Mark Nelson wrote:
> On 05/01/2015 07:33 PM, Sage Weil wrote:
> > Ok, I think I figured out what was going on.  The db->submit_transaction()
> > call (from _txc_finish_io) was blocking when there was a
> > submit_transaction_sync() in progress.  This was making me hit a ceiling
> > of about 80 iops on my slow disk.  When I moved that into _kv_sync_thread
> > (just prior to the submit_transaction_sync() call) it jumps up to 300+
> > iops.
> > 
> > I pushed that to wip-newstore.
> > 
> > Further, if I drop the O_DSYNC, it goes up another 50% or so.  It'll take
> > a bit more coding to effectively batch the (implicit) fdatasync from the
> > O_DSYNC up, though, and capture some of that.  Next!
> > 
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> Ran through a bunch of tests on 0c728ccc over the weekend:
> 
> http://nhm.ceph.com/newstore/5d96fe6f_vs_0c728ccc.pdf
> 
> The good news is that sequential writes on spinning disks are looking
> significantly better!  We went from 40x slower than filestore for small
> sequential IO to only about 30-40% slower and we become faster than filestore
> at 64kb+ IO sizes.
> 
> 128kb-2MB sequential writes with data on spinning disk and rocksdb on SSD
> regressed.  Newstore is no longer really any faster than filestore for those
> IO sizes.  We saw something similar for random IO, where spinning disk only
> results improved and spinning disk + rocksdb on SSD regressed.
> 
> With everything on SSD, we saw small sequential writes improve and nearly all
> random writes regress.  Not sure how much these regressions are due to
> 0c728ccc vs other commits yet.

That's surprising!  I pushed a commit that makes this tunable,

 newstore sync submit transaction = false (default)

Can you see if setting that to true (effectively reverting my last change) 
fixes the ssd regression?

It may also be that this is a simple locking issue that we can fix in 
rocksdb.  Again, the behavior I saw was that the db->submit_transaction() 
call would block until the sync commit (from kv_sync_thread) finished.  
I would expect rocksdb to be more careful about that, so maybe there is 
something else funny/subtle going on.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html