Re: latency compare between 2t NVME SSD P3500 and bluestore

Xiaoxi Chen <superdebuger@xxxxxxxxx> · Thu, 13 Jul 2017 00:34:24 +0800

FWIW, one thing that KVDB can provide is transaction support, which is
important as we need to update several metadata(Onode, allocator map,
and WAL for small write) transactionally.

2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@xxxxxxxxx>:
> Hi Sage,
>
> Indeed, I have an idea which hold a long time.
>
> Do we really need a heavy k/v database to store metadata? Especially
> for fast disks.... Introduce a third-party database also make
> difficulty for maintenance (maybe because of my limited database
> knowledge) ...
>
> Let's suppose:
> 1) The max pg number in one osd is limited(in my experience, 100~200
> pgs per osd is best performance)
> 2) The max number of objects in one pg is limited, because of disk space.
>
> Then, how about this: pre-allocate metadata locations in metadata partition。
>
> Part a SSD into two or three partitions(same as bluestore), instead of
> using kv database, just store metadata directly in one disk
> partition(we call it metadata partition). Inside this metadata
> partition, we store several data structures:
> 1) One hash table of PGs, key is PG id, value is another hash
> table(key is object index of this pg, value is object metadata, and
> object location in data partition).
> 2) A free object location list.
>
> And other extra things...
>
> The max pgs belongs to one OSD can be limited by options, so I believe
> the metadata partition should not be big. We could load all metadata
> into RAM if RAM is really big, or part of them and controlled by LRU,
> or just read, modify, and write back to disk when needed.
>
> Do you think this idea reasonable? At least, I believe this kind of
> new storage engine will be much faster.
>
> Thanks
> Pan
>
> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>> Hi Sage,
>>>
>>> Yes, I totally understand bluestore did much more things than a raw
>>> disk, but the current overhead is a little too big to our usage. I
>>> will compare bluestore with XFS(also has metadata tracking,
>>> allocation, and so on), and to see if XFS also has such impact.
>>>
>>> I would like to give a flamegraph later, but from the perfcounter, we
>>> could find most of time were spent in "kv_lat".
>>
>> That's rocksdb.  And yeah, I think it's pretty clear that either rocksdb
>> needs some serious work to really keep up with nvme (or optane) or (more
>> likely) we need an alternate kv backend that is targetting high speed
>> flash.  I suspect the latter makes the most sense, and I believe there are
>> various efforts at Intel looking at alternatives but no winner just yet.
>>
>> Looking a bit further out, I think a new kv library that natively targets
>> peristent memory (e.g., something built on pmem.io) will be the right
>> solution.  Although at that point, it's probbaly a question of whether we
>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
>> case a complete replacement for bluestore would make more sense.
>>
>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>> could discuss which part could be improved by FTL, firmware, even open
>>> channel.
>>
>> Yep!
>> sage
>>
>>
>>
>>
>>>
>>>
>>>
>>>
>>>
>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
>>> > On Wed, 12 Jul 2017, 攀刘 wrote:
>>> >> Hi Cephers,
>>> >>
>>> >> I did some experiment today to compare the latency between one
>>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>> >>
>>> >> For iodepth = 1, the random write latency of bluestore is 276.91us,
>>> >> compare with 14.71 of SSD, big overhead.
>>> >>
>>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)
>>> >>
>>> >> What is your opinion?
>>> >
>>> > There is a lot of work that bluestore is doing over the raw device as it
>>> > is implementing all of the metadata tracking, checksumming, allocation,
>>> > and so on.  There's definitely lots of room for improvement, but I'm
>>> > not sure you can expect to see latencies in the 10s of us.  That said, it
>>> > would be interesting to see an updated flamegraph to see where the time is
>>> > being spent and where we can slim this down.  On a new nvme it's possible
>>> > we can do away with some of the complexity of, say, the allocator, since
>>> > the FTL is performing a lot of the same work anyway.
>>> >
>>> > sage
>>>
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html