Re: latency compare between 2t NVME SSD P3500 and bluestore

Xiaoxi Chen <superdebuger@xxxxxxxxx> · Fri, 14 Jul 2017 17:31:36 +0800

in state_kv_commit stage, db-> submit_transcation will be called and
all rocksdb insert-key logical is done here, as shown in gdbprof, key
comparison and lookup is happening here, but as db->submit_transcation
will set sync=false, which will leave the change to RocksDB WAL
(potential) is only in memory, not persistent to disk

The *submit* you refer to, is just submit an empty transaction with
sync=true, to flush all previous WAL persistently into disk.

Clearly the kv_commit is CPU intensive and kv_submit is (sequential)
IO intensive. So depending on the CPU/Disk speed ratio, one may see
different profiling result.  My previous test on HDD show opposite
result that kv_lat is pretty long.

2017-07-14 16:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>:
> Here is the output of gdbprof: @Mark please have a look.
> I copied _kv_sync_thread and _kv_finalize_thread here.
> http://paste.openstack.org/show/615362/
>
> Lisa
>
> On Fri, Jul 14, 2017 at 9:54 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>> Hi Li,
>>
>> You may want to try my wallclock profiler to see where time is being spent
>> during your test.  It is located here:
>>
>> https://github.com/markhpc/gdbprof
>>
>> You can run it like:
>>
>> sudo gdb -ex 'set pagination off' -ex 'attach <pid>' -ex 'source
>> /home/ubuntu/src/markhpc/gdbprof/gdbprof.py' -ex 'profile begin' -ex 'quit'
>>
>> Mark
>>
>>
>> On 07/13/2017 08:47 PM, xiaoyan li wrote:
>>>
>>> Hi,
>>> I am concerned about the rocksdb impact on bluestore whole IO path. I
>>> did some test with bluestore fio plugin.
>>> For example, I got following data from the log when I did bluestore
>>> fio test with numjobs=64 and iopath=32. It seems that for every txc,
>>> most of the time spends on queued and commiting states.
>>> state time span(us)
>>> state_prepare_lat 386
>>> state_aio_wait_lat 430
>>> state_io_done_lat 0
>>> state_kv_queued_lat 7926
>>> state_kv_commiting_lat 30653
>>> state_kv_done_lat 4
>>>
>>>         "state_kv_queued_lat": {
>>>             "avgcount": 349076566,
>>>             "sum": 1214245.959889817,
>>>             "avgtime": 0.003478451
>>>         },
>>>         "state_kv_commiting_lat": {
>>>             "avgcount": 174538283,
>>>             "sum": 5612849.022306266,
>>>             "avgtime": 0.032158268
>>>         },
>>>
>>>
>>> And same time, to submit (174538283/3509556 = 49) txcs every time only
>>> takes 1024us, which is much less than commiting_lat 30653us.
>>>         "kv_lat": {
>>>             "avgcount": 3509556,
>>>             "sum": 3594.365142193,
>>>             "avgtime": 0.001024165
>>>         },
>>>
>>> The time between state_kv_queued_lat and state_kv_commiting_lat:
>>>
>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
>>>
>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
>>>
>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741
>>>
>>> I am still investigating why it spends so long time on
>>> kv_commiting_lat, but from above data I doubt it is the problem of
>>> rocksdb.
>>> Please correct me if I misunderstood anything.
>>>
>>> Lisa
>>>
>>>
>>> On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@xxxxxxxxx>
>>> wrote:
>>>>
>>>> FWIW, one thing that KVDB can provide is transaction support, which is
>>>> important as we need to update several metadata(Onode, allocator map,
>>>> and WAL for small write) transactionally.
>>>>
>>>>
>>>>
>>>> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@xxxxxxxxx>:
>>>>>
>>>>> Hi Sage,
>>>>>
>>>>> Indeed, I have an idea which hold a long time.
>>>>>
>>>>> Do we really need a heavy k/v database to store metadata? Especially
>>>>> for fast disks.... Introduce a third-party database also make
>>>>> difficulty for maintenance (maybe because of my limited database
>>>>> knowledge) ...
>>>>>
>>>>> Let's suppose:
>>>>> 1) The max pg number in one osd is limited(in my experience, 100~200
>>>>> pgs per osd is best performance)
>>>>> 2) The max number of objects in one pg is limited, because of disk
>>>>> space.
>>>>>
>>>>> Then, how about this: pre-allocate metadata locations in metadata
>>>>> partition。
>>>>>
>>>>> Part a SSD into two or three partitions(same as bluestore), instead of
>>>>> using kv database, just store metadata directly in one disk
>>>>> partition(we call it metadata partition). Inside this metadata
>>>>> partition, we store several data structures:
>>>>> 1) One hash table of PGs, key is PG id, value is another hash
>>>>> table(key is object index of this pg, value is object metadata, and
>>>>> object location in data partition).
>>>>> 2) A free object location list.
>>>>>
>>>>> And other extra things...
>>>>>
>>>>> The max pgs belongs to one OSD can be limited by options, so I believe
>>>>> the metadata partition should not be big. We could load all metadata
>>>>> into RAM if RAM is really big, or part of them and controlled by LRU,
>>>>> or just read, modify, and write back to disk when needed.
>>>>>
>>>>> Do you think this idea reasonable? At least, I believe this kind of
>>>>> new storage engine will be much faster.
>>>>>
>>>>> Thanks
>>>>> Pan
>>>>>
>>>>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
>>>>>>
>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>>
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> Yes, I totally understand bluestore did much more things than a raw
>>>>>>> disk, but the current overhead is a little too big to our usage. I
>>>>>>> will compare bluestore with XFS(also has metadata tracking,
>>>>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>>>>
>>>>>>> I would like to give a flamegraph later, but from the perfcounter, we
>>>>>>> could find most of time were spent in "kv_lat".
>>>>>>
>>>>>>
>>>>>> That's rocksdb.  And yeah, I think it's pretty clear that either
>>>>>> rocksdb
>>>>>> needs some serious work to really keep up with nvme (or optane) or
>>>>>> (more
>>>>>> likely) we need an alternate kv backend that is targetting high speed
>>>>>> flash.  I suspect the latter makes the most sense, and I believe there
>>>>>> are
>>>>>> various efforts at Intel looking at alternatives but no winner just
>>>>>> yet.
>>>>>>
>>>>>> Looking a bit further out, I think a new kv library that natively
>>>>>> targets
>>>>>> peristent memory (e.g., something built on pmem.io) will be the right
>>>>>> solution.  Although at that point, it's probbaly a question of whether
>>>>>> we
>>>>>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
>>>>>> case a complete replacement for bluestore would make more sense.
>>>>>>
>>>>>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>>>>>> could discuss which part could be improved by FTL, firmware, even open
>>>>>>> channel.
>>>>>>
>>>>>>
>>>>>> Yep!
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
>>>>>>>>
>>>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>>>>
>>>>>>>>> Hi Cephers,
>>>>>>>>>
>>>>>>>>> I did some experiment today to compare the latency between one
>>>>>>>>> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>>>>>>>
>>>>>>>>> For iodepth = 1, the random write latency of bluestore is 276.91us,
>>>>>>>>> compare with 14.71 of SSD, big overhead.
>>>>>>>>>
>>>>>>>>> I also test iodepth = 16, Still, there is a big overhead.(143 us ->
>>>>>>>>> 642 us)
>>>>>>>>>
>>>>>>>>> What is your opinion?
>>>>>>>>
>>>>>>>>
>>>>>>>> There is a lot of work that bluestore is doing over the raw device as
>>>>>>>> it
>>>>>>>> is implementing all of the metadata tracking, checksumming,
>>>>>>>> allocation,
>>>>>>>> and so on.  There's definitely lots of room for improvement, but I'm
>>>>>>>> not sure you can expect to see latencies in the 10s of us.  That
>>>>>>>> said, it
>>>>>>>> would be interesting to see an updated flamegraph to see where the
>>>>>>>> time is
>>>>>>>> being spent and where we can slim this down.  On a new nvme it's
>>>>>>>> possible
>>>>>>>> we can do away with some of the complexity of, say, the allocator,
>>>>>>>> since
>>>>>>>> the FTL is performing a lot of the same work anyway.
>>>>>>>>
>>>>>>>> sage
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>>
>
>
>
> --
> Best wishes
> Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html