Hi Mark, Really good test:) I only played a bit on SSD, the parallel WAL threads really helps but we still have a long way to go especially on all-ssd case. I tried this https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515 by hacking the rocksdb, but the performance difference is negligible. The rocksdb digest speed should be the problem, I believe, I was planned to prove this by skip all db transaction, but failed since hitting other deadlock bug in newstore. Below are a bit more comments. > Sage has been furiously working away at fixing bugs in newstore and > improving performance. Specifically we've been focused on write > performance as newstore was lagging filestore but quite a bit previously. A > lot of work has gone into implementing libaio behind the scenes and as a > result performance on spinning disks with SSD WAL (and SSD backed rocksdb) > has improved pretty dramatically. It's now often beating filestore: > SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ? I suspect this would improve performance by prevent some IO with high WA cost and latency? > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf > > On the other hand, sequential writes are slower than random writes when > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD. I think sequential writes slower than random is by design in Newstore, because for every object we can only have one WAL , that means no concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you have in the test? I suspect 64 since there is a boost in seq write performance with req size > 64 ( 64KB*64=4MB). In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS -> Sync, we do everything in synchronize way ,which is essentially expensive. Xiaoxi. > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson > Sent: Wednesday, April 29, 2015 7:25 AM > To: ceph-devel > Subject: newstore performance update > > Hi Guys, > > Sage has been furiously working away at fixing bugs in newstore and > improving performance. Specifically we've been focused on write > performance as newstore was lagging filestore but quite a bit previously. A > lot of work has gone into implementing libaio behind the scenes and as a > result performance on spinning disks with SSD WAL (and SSD backed rocksdb) > has improved pretty dramatically. It's now often beating filestore: > > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf > > On the other hand, sequential writes are slower than random writes when > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD. > In this situation newstore does better with random writes and sometimes > beats filestore (such as in the everything-on-spinning disk tests, and when IO > sizes are small in the everything-on-ssd tests). > > Newstore is changing daily so keep in mind that these results are almost > assuredly going to change. An interesting area of investigation will be why > sequential writes are slower than random writes, and whether or not we are > being limited by rocksdb ingest speed and how. > > I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB > sequential write test to see if rocksdb was starving one of the cores, but > found something that looks quite a bit different: > > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf > > Mark > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f