Re: newstore performance update

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote:
I am not sure I really understand the osd code, but from the osd log,  in the sequential small write case, only one inflight op happening…

and Mark, did you pre-allocate the rbd before doing sequential test? I believe you did, so both seq and random are in WAL mode.

Yes, the RBD image is pre-allocated. Maybe Sage can chime in regarding the one inflight op.

Mark


---- Mark Nelson编写 ----


On 04/29/2015 11:38 AM, Sage Weil wrote:
On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
Hi Mark,
      Really good test:) I only played a bit on SSD, the parallel WAL
threads really helps but we still have a long way to go especially on
all-ssd case. I tried this
https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
by hacking the rocksdb, but the performance difference is negligible.

It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
and committed the change to the branch.  Probably not noticeable on the
SSD, though it can't hurt.

The rocksdb digest speed should be the problem, I believe, I was planned
to prove this by skip all db transaction, but failed since hitting other
deadlock bug in newstore.

Will look at that next!


Below are a bit more comments.
Sage has been furiously working away at fixing bugs in newstore and
improving performance.  Specifically we've been focused on write
performance as newstore was lagging filestore but quite a bit previously.  A
lot of work has gone into implementing libaio behind the scenes and as a
result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
has improved pretty dramatically. It's now often beating filestore:


SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?

http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf

On the other hand, sequential writes are slower than random writes when
the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.

I think sequential writes slower than random is by design in Newstore,
because for every object we can only have one WAL , that means no
concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
have in the test? I suspect 64 since there is a boost in seq write
performance with req size > 64 ( 64KB*64=4MB).

In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
FS -> Sync, we do everything in synchronize way ,which is essentially
expensive.

The number of syncs is the same for appends vs wal... in both cases we
fdatasync the file and the db commit, but with WAL the fs sync comes after
the commit point instead of before (and we don't double-write the data).
Appends should still be pipelined (many in flight for the same object)...
and the db syncs will be batched in both cases (submit_transaction for
each io, and a single thread doing the submit_transaction_sync in a loop).

If that's not the case then it's an accident?

sage

So I ran some more tests last night on 2c914df7 to see if any of the new
changes made much difference for spinning disk small sequential writes,
and the short answer is no.  Since overlay now works again I also ran
tests with overlay enabled, and this may have helped marginally (and had
mixed results for random writes, may need to tweak the default).

After this I got to thinking about how the WAL-on-SSD results were so
much better that I wanted to confirm that this issue is WAL related.  I
tried setting DisableWAL. This resulted in about a 90x increase in
sequential write performance, but only a 2x increase in random write
performance.  What's more, if you look at the last graph on the pdf
linked below, you can see that sequential 4k writes with WAL enabled are
significantly slower than 4K random writes, but sequential 4K writes
with WAL disabled are significantly faster.

http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf

So I guess now I wonder what is happening that is different in each
case.  I'll probably sit down and start looking through the blktrace
data and try to get more statistics out of rocksdb for each case.  It
would be useful if we could tie the rocksdb stats call into an asok command:

DB::GetProperty("rocksdb.stats", &stats)

Mark

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux