Re: Questions for NVRAM+SATA SSDs with Bluestore

Mark Nelson <mnelson@xxxxxxxxxx> · Sat, 2 Jul 2016 22:28:14 -0500

Hi Ning,

The metadata issues are well known, see the recent threads and PRs 
regarding reducing the onode size.  I think this issue is two-pronged. 
We need to do everything we can to reduce the size of the onode (and the 
recent PRs should help significantly).  Beyond that, rocksdb itself 
tends to have high write (and read) amp.  We can probably improve things 
to a certain extent with tuning, but an alternate option is to use 
Sandisk's zetascale kvstore.  They are planning on submitting a PR for 
their kvstore code soonish.  Zetascale itself is already available and 
is open source:

https://github.com/SanDisk-Open-Source/zetascale

Ramesh has also recently submitted a PR for an in-memory store that we 
can test against to see how rocksdb and zetascale are doing.  Perhaps 
some day we could do something similar for nvdimm storage.

Mark

On 07/02/2016 03:18 PM, Ning Yao wrote:
Hi, Sage
I have tried with bluestore_min_alloc_size = 4096 so that the updated
writes can always reallocate inito new extents. It avoids double write
theoretically, but with high speed device like nvme，it still has
performance issue with metadata updating and the bottleneck is
apparently in rocksdb. I think the compaction and data organization in
rocksdb may affect a lots.  It may have lots of works to do with
rocksdb and bluefs such as using different compaction strategies and
use less number of levels in rocksdb?　So any guides about those and
what  is the future directions on current bluestore performance issue?
Regards
Ning Yao

2016-06-27 20:31 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
On Mon, 27 Jun 2016, myoungwon oh wrote:
Hi, I have questions for bluestore (4K random write case).

So far, we have used NVRAM(PCIe) as journal and SSD (SATA) as data
disk (filestore).
Therefore, we got performance gain from NVRAM journal.
However, current Bluestore design seems that data (4K aligned) is
written to data disk first, then metadata is written to WAL rocksdb.
This design can remove “double write” in objectstore, but in our case,
NVRAM can not be utilized fully.

 So, my questions are that

1. Can bluestore write WAL first as filestore?

You can do it indirectly with bluestore_min_alloc_size=65536, which will
send anything smaller than this value through the wal path.  Please let
us know what effect this has on our latency/performance!

2. If not, using bcache or flashcache for NVRAM on top of SSDs is right
answer?

This is also possible, but I expect we'd like to make this work out of the
box if we can!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html