Re: newstore performance update

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 29 Apr 2015 08:20:18 -0500

On 04/29/2015 03:33 AM, Chen, Xiaoxi wrote:
Hi Mark,
	Really good test:) I only played a bit on SSD, the parallel WAL threads really helps but we still have a long way to go especially on all-ssd case.
I tried this https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515  by hacking the rocksdb, but the performance difference is negligible.

The rocksdb digest speed should be the problem, I believe, I was planned to prove this by skip all db transaction, but failed since hitting other deadlock bug in newstore.

I think sage has worked through all of the deadlock bugs I was seeing 
short of possibly something going on with the overlay code.  That 
probably shouldn't matter on SSD though as it's probably best to leave 
overlay off.

Below are a bit more comments.
Sage has been furiously working away at fixing bugs in newstore and
improving performance.  Specifically we've been focused on write
performance as newstore was lagging filestore but quite a bit previously.  A
lot of work has gone into implementing libaio behind the scenes and as a
result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
has improved pretty dramatically. It's now often beating filestore:

SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?

Seems like it could work, but I wish we didn't have to add a workaround. 
 It'd be nice if we could just tell rocksdb not to propagate that data. 
 I don't remember, can we use column families for this?

http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf

On the other hand, sequential writes are slower than random writes when
the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.

I think sequential writes slower than random is by design in Newstore, because for every object we can only have one WAL , that means no concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you have in the test? I suspect 64 since there is a boost in seq write performance with req size > 64 ( 64KB*64=4MB).

You nailed it, 64.

In this case,  IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS -> Sync,  we do everything in synchronize way ,which is essentially expensive.

Will you be on the performance call this morning?  Perhaps we can talk 
about it more there?

													Xiaoxi.
-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
Sent: Wednesday, April 29, 2015 7:25 AM
To: ceph-devel
Subject: newstore performance update

Hi Guys,

Sage has been furiously working away at fixing bugs in newstore and
improving performance.  Specifically we've been focused on write
performance as newstore was lagging filestore but quite a bit previously.  A
lot of work has gone into implementing libaio behind the scenes and as a
result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
has improved pretty dramatically. It's now often beating filestore:

http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf

On the other hand, sequential writes are slower than random writes when
the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.

In this situation newstore does better with random writes and sometimes
beats filestore (such as in the everything-on-spinning disk tests, and when IO
sizes are small in the everything-on-ssd tests).

Newstore is changing daily so keep in mind that these results are almost
assuredly going to change.  An interesting area of investigation will be why
sequential writes are slower than random writes, and whether or not we are
being limited by rocksdb ingest speed and how.

I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
sequential write test to see if rocksdb was starving one of the cores, but
found something that looks quite a bit different:

http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
http://vger.kernel.org/majordomo-info.html
N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+�����ݢj"��!tml=

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html