Re: latency compare between 2t NVME SSD P3500 and bluestore

xiaoyan li <wisher2003@xxxxxxxxx> · Fri, 14 Jul 2017 09:47:18 +0800

Hi,
I am concerned about the rocksdb impact on bluestore whole IO path. I
did some test with bluestore fio plugin.
For example, I got following data from the log when I did bluestore
fio test with numjobs=64 and iopath=32. It seems that for every txc,
most of the time spends on queued and commiting states.
state time span(us)
state_prepare_lat 386
state_aio_wait_lat 430
state_io_done_lat 0
state_kv_queued_lat 7926
state_kv_commiting_lat 30653
state_kv_done_lat 4

        "state_kv_queued_lat": {
            "avgcount": 349076566,
            "sum": 1214245.959889817,
            "avgtime": 0.003478451
        },
        "state_kv_commiting_lat": {
            "avgcount": 174538283,
            "sum": 5612849.022306266,
            "avgtime": 0.032158268
        },

And same time, to submit (174538283/3509556 = 49) txcs every time only
takes 1024us, which is much less than commiting_lat 30653us.
        "kv_lat": {
            "avgcount": 3509556,
            "sum": 3594.365142193,
            "avgtime": 0.001024165
        },

The time between state_kv_queued_lat and state_kv_commiting_lat:
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741

I am still investigating why it spends so long time on
kv_commiting_lat, but from above data I doubt it is the problem of
rocksdb.
Please correct me if I misunderstood anything.

Lisa

On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@xxxxxxxxx> wrote:
> FWIW, one thing that KVDB can provide is transaction support, which is
> important as we need to update several metadata(Onode, allocator map,
> and WAL for small write) transactionally.
>
>
>
> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@xxxxxxxxx>:
>> Hi Sage,
>>
>> Indeed, I have an idea which hold a long time.
>>
>> Do we really need a heavy k/v database to store metadata? Especially
>> for fast disks.... Introduce a third-party database also make
>> difficulty for maintenance (maybe because of my limited database
>> knowledge) ...
>>
>> Let's suppose:
>> 1) The max pg number in one osd is limited(in my experience, 100~200
>> pgs per osd is best performance)
>> 2) The max number of objects in one pg is limited, because of disk space.
>>
>> Then, how about this: pre-allocate metadata locations in metadata partition。
>>
>> Part a SSD into two or three partitions(same as bluestore), instead of
>> using kv database, just store metadata directly in one disk
>> partition(we call it metadata partition). Inside this metadata
>> partition, we store several data structures:
>> 1) One hash table of PGs, key is PG id, value is another hash
>> table(key is object index of this pg, value is object metadata, and
>> object location in data partition).
>> 2) A free object location list.
>>
>> And other extra things...
>>
>> The max pgs belongs to one OSD can be limited by options, so I believe
>> the metadata partition should not be big. We could load all metadata
>> into RAM if RAM is really big, or part of them and controlled by LRU,
>> or just read, modify, and write back to disk when needed.
>>
>> Do you think this idea reasonable? At least, I believe this kind of
>> new storage engine will be much faster.
>>
>> Thanks
>> Pan
>>
>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>> Hi Sage,
>>>>
>>>> Yes, I totally understand bluestore did much more things than a raw
>>>> disk, but the current overhead is a little too big to our usage. I
>>>> will compare bluestore with XFS(also has metadata tracking,
>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>
>>>> I would like to give a flamegraph later, but from the perfcounter, we
>>>> could find most of time were spent in "kv_lat".
>>>
>>> That's rocksdb.  And yeah, I think it's pretty clear that either rocksdb
>>> needs some serious work to really keep up with nvme (or optane) or (more
>>> likely) we need an alternate kv backend that is targetting high speed
>>> flash.  I suspect the latter makes the most sense, and I believe there are
>>> various efforts at Intel looking at alternatives but no winner just yet.
>>>
>>> Looking a bit further out, I think a new kv library that natively targets
>>> peristent memory (e.g., something built on pmem.io) will be the right
>>> solution.  Although at that point, it's probbaly a question of whether we
>>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
>>> case a complete replacement for bluestore would make more sense.
>>>
>>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>>> could discuss which part could be improved by FTL, firmware, even open
>>>> channel.
>>>
>>> Yep!
>>> sage
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
>>>> > On Wed, 12 Jul 2017, 攀刘 wrote:
>>>> >> Hi Cephers,
>>>> >>
>>>> >> I did some experiment today to compare the latency between one
>>>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>> >>
>>>> >> For iodepth = 1, the random write latency of bluestore is 276.91us,
>>>> >> compare with 14.71 of SSD, big overhead.
>>>> >>
>>>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)
>>>> >>
>>>> >> What is your opinion?
>>>> >
>>>> > There is a lot of work that bluestore is doing over the raw device as it
>>>> > is implementing all of the metadata tracking, checksumming, allocation,
>>>> > and so on.  There's definitely lots of room for improvement, but I'm
>>>> > not sure you can expect to see latencies in the 10s of us.  That said, it
>>>> > would be interesting to see an updated flamegraph to see where the time is
>>>> > being spent and where we can slim this down.  On a new nvme it's possible
>>>> > we can do away with some of the complexity of, say, the allocator, since
>>>> > the FTL is performing a lot of the same work anyway.
>>>> >
>>>> > sage
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html