RE: newstore performance update

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Wed, 29 Apr 2015 15:00:30 +0000



> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> Sent: Wednesday, April 29, 2015 9:20 PM
> To: Chen, Xiaoxi
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: newstore performance update
> 
> 
> 
> On 04/29/2015 03:33 AM, Chen, Xiaoxi wrote:
> > Hi Mark,
> > 	Really good test:) I only played a bit on SSD, the parallel WAL threads
> really helps but we still have a long way to go especially on all-ssd case.
> > I tried this
> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
> by hacking the rocksdb, but the performance difference is negligible.
> >
> > The rocksdb digest speed should be the problem, I believe, I was planned
> to prove this by skip all db transaction, but failed since hitting other deadlock
> bug in newstore.
> 
> I think sage has worked through all of the deadlock bugs I was seeing short of
> possibly something going on with the overlay code.  That probably shouldn't
> matter on SSD though as it's probably best to leave overlay off.
> 
> >
> > Below are a bit more comments.
> >> Sage has been furiously working away at fixing bugs in newstore and
> >> improving performance.  Specifically we've been focused on write
> >> performance as newstore was lagging filestore but quite a bit
> >> previously.  A lot of work has gone into implementing libaio behind
> >> the scenes and as a result performance on spinning disks with SSD WAL
> >> (and SSD backed rocksdb) has improved pretty dramatically. It's now
> often beating filestore:
> >>
> >
> > SSD DB is still better than SSD WAL with request size > 128KB, this indicate
> some WALs are actually written to Level0...Hmm, could we add
> newstore_wal_max_ops/bytes to capping the total WAL size(how much data
> is in WAL but not yet apply to backend FS) ?  I suspect this would improve
> performance by prevent some IO with high WA cost and latency?
> 
> Seems like it could work, but I wish we didn't have to add a workaround.
>   It'd be nice if we could just tell rocksdb not to propagate that data.
>   I don't remember, can we use column families for this?
> 
No, column families will not help to this case,  we want to use column families to enforce different layout and policy for different kind of data.
For example , WAL items go with large write buffer that optimize for write(with the cost of read amplification) , and no block cache(read cache) should be there. But Onode should go with large block cache and Fewer level0, that reduce read amplification.... With Column families we can support this usage.
> >
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> On the other hand, sequential writes are slower than random writes
> >> when the OSD, DB, and WAL are all on the same device be it a spinning
> disk or SSD.
> >
> > I think sequential writes slower than random is by design in Newstore,
> because for every object we can only have one WAL , that means no
> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
> have in the test? I suspect 64 since there is a boost in seq write performance
> with req size > 64 ( 64KB*64=4MB).
> 
> You nailed it, 64.
> 
> >
> > In this case,  IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS ->
> Sync,  we do everything in synchronize way ,which is essentially expensive.
> 
> Will you be on the performance call this morning?  Perhaps we can talk about
> it more there?

Will be there, see you then.
> 
> >
> >
> 				Xiaoxi.
> >> -----Original Message-----
> >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> >> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> >> Sent: Wednesday, April 29, 2015 7:25 AM
> >> To: ceph-devel
> >> Subject: newstore performance update
> >>
> >> Hi Guys,
> >>
> >> Sage has been furiously working away at fixing bugs in newstore and
> >> improving performance.  Specifically we've been focused on write
> >> performance as newstore was lagging filestore but quite a bit
> >> previously.  A lot of work has gone into implementing libaio behind
> >> the scenes and as a result performance on spinning disks with SSD WAL
> >> (and SSD backed rocksdb) has improved pretty dramatically. It's now
> often beating filestore:
> >>
> >
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> On the other hand, sequential writes are slower than random writes
> >> when the OSD, DB, and WAL are all on the same device be it a spinning
> disk or SSD.
> >
> >> In this situation newstore does better with random writes and
> >> sometimes beats filestore (such as in the everything-on-spinning disk
> >> tests, and when IO sizes are small in the everything-on-ssd tests).
> >>
> >> Newstore is changing daily so keep in mind that these results are
> >> almost assuredly going to change.  An interesting area of
> >> investigation will be why sequential writes are slower than random
> >> writes, and whether or not we are being limited by rocksdb ingest speed
> and how.
> >
> >>
> >> I've also uploaded a quick perf call-graph I grabbed during the
> >> "all-SSD" 32KB sequential write test to see if rocksdb was starving
> >> one of the cores, but found something that looks quite a bit different:
> >>
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> Mark
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> > N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay ʇڙ ,j   f   h   z  w
> 
>    j:+v   w j m         zZ+     ݢj"  !tml=
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f