Re: latency compare between 2t NVME SSD P3500 and bluestore

xiaoyan li <wisher2003@xxxxxxxxx> · Fri, 14 Jul 2017 16:22:00 +0800



On Fri, Jul 14, 2017 at 10:49 AM, Ma, Jianpeng <jianpeng.ma@xxxxxxxxx> wrote:
> "state_kv_commiting_lat -  kv_lat" mean the latency for thread " _kv_finalize_thread".
> If is this correctly?
Not exactly true. state_kv_commiting_lat is per txc, kv_lat is per
_kv_sync_thread call, which handle kv update of txcs in the
kv_queue_unsubmitted.

>
> Jianpeng
>
> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of xiaoyan li
> Sent: Friday, July 14, 2017 9:47 AM
> To: Xiaoxi Chen <superdebuger@xxxxxxxxx>
> Cc: 攀刘 <liupan1111@xxxxxxxxx>; Sage Weil <sage@xxxxxxxxxxxx>; Ceph Development <ceph-devel@xxxxxxxxxxxxxxx>; p.zhou@xxxxxxxxxxxxxxx; 20702390@xxxxxx
> Subject: Re: latency compare between 2t NVME SSD P3500 and bluestore
>
> Hi,
> I am concerned about the rocksdb impact on bluestore whole IO path. I did some test with bluestore fio plugin.
> For example, I got following data from the log when I did bluestore fio test with numjobs=64 and iopath=32. It seems that for every txc, most of the time spends on queued and commiting states.
> state time span(us)
> state_prepare_lat 386
> state_aio_wait_lat 430
> state_io_done_lat 0
> state_kv_queued_lat 7926
> state_kv_commiting_lat 30653
> state_kv_done_lat 4
>
>         "state_kv_queued_lat": {
>             "avgcount": 349076566,
>             "sum": 1214245.959889817,
>             "avgtime": 0.003478451
>         },
>         "state_kv_commiting_lat": {
>             "avgcount": 174538283,
>             "sum": 5612849.022306266,
>             "avgtime": 0.032158268
>         },
>
>
> And same time, to submit (174538283/3509556 = 49) txcs every time only takes 1024us, which is much less than commiting_lat 30653us.
>         "kv_lat": {
>             "avgcount": 3509556,
>             "sum": 3594.365142193,
>             "avgtime": 0.001024165
>         },
>
> The time between state_kv_queued_lat and state_kv_commiting_lat:
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741
>
> I am still investigating why it spends so long time on kv_commiting_lat, but from above data I doubt it is the problem of rocksdb.
> Please correct me if I misunderstood anything.
>
> Lisa
>
>
> On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@xxxxxxxxx> wrote:
>> FWIW, one thing that KVDB can provide is transaction support, which is
>> important as we need to update several metadata(Onode, allocator map,
>> and WAL for small write) transactionally.
>>
>>
>>
>> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@xxxxxxxxx>:
>>> Hi Sage,
>>>
>>> Indeed, I have an idea which hold a long time.
>>>
>>> Do we really need a heavy k/v database to store metadata? Especially
>>> for fast disks.... Introduce a third-party database also make
>>> difficulty for maintenance (maybe because of my limited database
>>> knowledge) ...
>>>
>>> Let's suppose:
>>> 1) The max pg number in one osd is limited(in my experience, 100~200
>>> pgs per osd is best performance)
>>> 2) The max number of objects in one pg is limited, because of disk space.
>>>
>>> Then, how about this: pre-allocate metadata locations in metadata
>>> partition。
>>>
>>> Part a SSD into two or three partitions(same as bluestore), instead
>>> of using kv database, just store metadata directly in one disk
>>> partition(we call it metadata partition). Inside this metadata
>>> partition, we store several data structures:
>>> 1) One hash table of PGs, key is PG id, value is another hash
>>> table(key is object index of this pg, value is object metadata, and
>>> object location in data partition).
>>> 2) A free object location list.
>>>
>>> And other extra things...
>>>
>>> The max pgs belongs to one OSD can be limited by options, so I
>>> believe the metadata partition should not be big. We could load all
>>> metadata into RAM if RAM is really big, or part of them and
>>> controlled by LRU, or just read, modify, and write back to disk when needed.
>>>
>>> Do you think this idea reasonable? At least, I believe this kind of
>>> new storage engine will be much faster.
>>>
>>> Thanks
>>> Pan
>>>
>>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>> Hi Sage,
>>>>>
>>>>> Yes, I totally understand bluestore did much more things than a raw
>>>>> disk, but the current overhead is a little too big to our usage. I
>>>>> will compare bluestore with XFS(also has metadata tracking,
>>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>>
>>>>> I would like to give a flamegraph later, but from the perfcounter,
>>>>> we could find most of time were spent in "kv_lat".
>>>>
>>>> That's rocksdb.  And yeah, I think it's pretty clear that either
>>>> rocksdb needs some serious work to really keep up with nvme (or
>>>> optane) or (more
>>>> likely) we need an alternate kv backend that is targetting high
>>>> speed flash.  I suspect the latter makes the most sense, and I
>>>> believe there are various efforts at Intel looking at alternatives but no winner just yet.
>>>>
>>>> Looking a bit further out, I think a new kv library that natively
>>>> targets peristent memory (e.g., something built on pmem.io) will be
>>>> the right solution.  Although at that point, it's probbaly a
>>>> question of whether we have pmem for metadata and 3D NAND for data
>>>> or pure pmem; in the latter case a complete replacement for bluestore would make more sense.
>>>>
>>>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>>>> could discuss which part could be improved by FTL, firmware, even
>>>>> open channel.
>>>>
>>>> Yep!
>>>> sage
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
>>>>> > On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>> >> Hi Cephers,
>>>>> >>
>>>>> >> I did some experiment today to compare the latency between one
>>>>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>>> >>
>>>>> >> For iodepth = 1, the random write latency of bluestore is
>>>>> >> 276.91us, compare with 14.71 of SSD, big overhead.
>>>>> >>
>>>>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us
>>>>> >> -> 642 us)
>>>>> >>
>>>>> >> What is your opinion?
>>>>> >
>>>>> > There is a lot of work that bluestore is doing over the raw
>>>>> > device as it is implementing all of the metadata tracking,
>>>>> > checksumming, allocation, and so on.  There's definitely lots of
>>>>> > room for improvement, but I'm not sure you can expect to see
>>>>> > latencies in the 10s of us.  That said, it would be interesting
>>>>> > to see an updated flamegraph to see where the time is being spent
>>>>> > and where we can slim this down.  On a new nvme it's possible we
>>>>> > can do away with some of the complexity of, say, the allocator, since the FTL is performing a lot of the same work anyway.
>>>>> >
>>>>> > sage
>>>>>
>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best wishes
> Lisa
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html