Re: Questions for NVRAM+SATA SSDs with Bluestore

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ning,

The metadata issues are well known, see the recent threads and PRs regarding reducing the onode size. I think this issue is two-pronged. We need to do everything we can to reduce the size of the onode (and the recent PRs should help significantly). Beyond that, rocksdb itself tends to have high write (and read) amp. We can probably improve things to a certain extent with tuning, but an alternate option is to use Sandisk's zetascale kvstore. They are planning on submitting a PR for their kvstore code soonish. Zetascale itself is already available and is open source:

https://github.com/SanDisk-Open-Source/zetascale

Ramesh has also recently submitted a PR for an in-memory store that we can test against to see how rocksdb and zetascale are doing. Perhaps some day we could do something similar for nvdimm storage.

Mark

On 07/02/2016 03:18 PM, Ning Yao wrote:
Hi, Sage
I have tried with bluestore_min_alloc_size = 4096 so that the updated
writes can always reallocate inito new extents. It avoids double write
theoretically, but with high speed device like nvme,it still has
performance issue with metadata updating and the bottleneck is
apparently in rocksdb. I think the compaction and data organization in
rocksdb may affect a lots.  It may have lots of works to do with
rocksdb and bluefs such as using different compaction strategies and
use less number of levels in rocksdb? So any guides about those and
what  is the future directions on current bluestore performance issue?
Regards
Ning Yao


2016-06-27 20:31 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
On Mon, 27 Jun 2016, myoungwon oh wrote:
Hi, I have questions for bluestore (4K random write case).

So far, we have used NVRAM(PCIe) as journal and SSD (SATA) as data
disk (filestore).
Therefore, we got performance gain from NVRAM journal.
However, current Bluestore design seems that data (4K aligned) is
written to data disk first, then metadata is written to WAL rocksdb.
This design can remove “double write” in objectstore, but in our case,
NVRAM can not be utilized fully.

 So, my questions are that

1. Can bluestore write WAL first as filestore?

You can do it indirectly with bluestore_min_alloc_size=65536, which will
send anything smaller than this value through the wal path.  Please let
us know what effect this has on our latency/performance!

2. If not, using bcache or flashcache for NVRAM on top of SSDs is right
answer?

This is also possible, but I expect we'd like to make this work out of the
box if we can!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux