in state_kv_commit stage, db-> submit_transcation will be called and all rocksdb insert-key logical is done here, as shown in gdbprof, key comparison and lookup is happening here, but as db->submit_transcation will set sync=false, which will leave the change to RocksDB WAL (potential) is only in memory, not persistent to disk The *submit* you refer to, is just submit an empty transaction with sync=true, to flush all previous WAL persistently into disk. Clearly the kv_commit is CPU intensive and kv_submit is (sequential) IO intensive. So depending on the CPU/Disk speed ratio, one may see different profiling result. My previous test on HDD show opposite result that kv_lat is pretty long. 2017-07-14 16:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>: > Here is the output of gdbprof: @Mark please have a look. > I copied _kv_sync_thread and _kv_finalize_thread here. > http://paste.openstack.org/show/615362/ > > Lisa > > On Fri, Jul 14, 2017 at 9:54 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: >> Hi Li, >> >> You may want to try my wallclock profiler to see where time is being spent >> during your test. It is located here: >> >> https://github.com/markhpc/gdbprof >> >> You can run it like: >> >> sudo gdb -ex 'set pagination off' -ex 'attach <pid>' -ex 'source >> /home/ubuntu/src/markhpc/gdbprof/gdbprof.py' -ex 'profile begin' -ex 'quit' >> >> Mark >> >> >> On 07/13/2017 08:47 PM, xiaoyan li wrote: >>> >>> Hi, >>> I am concerned about the rocksdb impact on bluestore whole IO path. I >>> did some test with bluestore fio plugin. >>> For example, I got following data from the log when I did bluestore >>> fio test with numjobs=64 and iopath=32. It seems that for every txc, >>> most of the time spends on queued and commiting states. >>> state time span(us) >>> state_prepare_lat 386 >>> state_aio_wait_lat 430 >>> state_io_done_lat 0 >>> state_kv_queued_lat 7926 >>> state_kv_commiting_lat 30653 >>> state_kv_done_lat 4 >>> >>> "state_kv_queued_lat": { >>> "avgcount": 349076566, >>> "sum": 1214245.959889817, >>> "avgtime": 0.003478451 >>> }, >>> "state_kv_commiting_lat": { >>> "avgcount": 174538283, >>> "sum": 5612849.022306266, >>> "avgtime": 0.032158268 >>> }, >>> >>> >>> And same time, to submit (174538283/3509556 = 49) txcs every time only >>> takes 1024us, which is much less than commiting_lat 30653us. >>> "kv_lat": { >>> "avgcount": 3509556, >>> "sum": 3594.365142193, >>> "avgtime": 0.001024165 >>> }, >>> >>> The time between state_kv_queued_lat and state_kv_commiting_lat: >>> >>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349 >>> >>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366 >>> >>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741 >>> >>> I am still investigating why it spends so long time on >>> kv_commiting_lat, but from above data I doubt it is the problem of >>> rocksdb. >>> Please correct me if I misunderstood anything. >>> >>> Lisa >>> >>> >>> On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@xxxxxxxxx> >>> wrote: >>>> >>>> FWIW, one thing that KVDB can provide is transaction support, which is >>>> important as we need to update several metadata(Onode, allocator map, >>>> and WAL for small write) transactionally. >>>> >>>> >>>> >>>> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@xxxxxxxxx>: >>>>> >>>>> Hi Sage, >>>>> >>>>> Indeed, I have an idea which hold a long time. >>>>> >>>>> Do we really need a heavy k/v database to store metadata? Especially >>>>> for fast disks.... Introduce a third-party database also make >>>>> difficulty for maintenance (maybe because of my limited database >>>>> knowledge) ... >>>>> >>>>> Let's suppose: >>>>> 1) The max pg number in one osd is limited(in my experience, 100~200 >>>>> pgs per osd is best performance) >>>>> 2) The max number of objects in one pg is limited, because of disk >>>>> space. >>>>> >>>>> Then, how about this: pre-allocate metadata locations in metadata >>>>> partition。 >>>>> >>>>> Part a SSD into two or three partitions(same as bluestore), instead of >>>>> using kv database, just store metadata directly in one disk >>>>> partition(we call it metadata partition). Inside this metadata >>>>> partition, we store several data structures: >>>>> 1) One hash table of PGs, key is PG id, value is another hash >>>>> table(key is object index of this pg, value is object metadata, and >>>>> object location in data partition). >>>>> 2) A free object location list. >>>>> >>>>> And other extra things... >>>>> >>>>> The max pgs belongs to one OSD can be limited by options, so I believe >>>>> the metadata partition should not be big. We could load all metadata >>>>> into RAM if RAM is really big, or part of them and controlled by LRU, >>>>> or just read, modify, and write back to disk when needed. >>>>> >>>>> Do you think this idea reasonable? At least, I believe this kind of >>>>> new storage engine will be much faster. >>>>> >>>>> Thanks >>>>> Pan >>>>> >>>>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>: >>>>>> >>>>>> On Wed, 12 Jul 2017, 攀刘 wrote: >>>>>>> >>>>>>> Hi Sage, >>>>>>> >>>>>>> Yes, I totally understand bluestore did much more things than a raw >>>>>>> disk, but the current overhead is a little too big to our usage. I >>>>>>> will compare bluestore with XFS(also has metadata tracking, >>>>>>> allocation, and so on), and to see if XFS also has such impact. >>>>>>> >>>>>>> I would like to give a flamegraph later, but from the perfcounter, we >>>>>>> could find most of time were spent in "kv_lat". >>>>>> >>>>>> >>>>>> That's rocksdb. And yeah, I think it's pretty clear that either >>>>>> rocksdb >>>>>> needs some serious work to really keep up with nvme (or optane) or >>>>>> (more >>>>>> likely) we need an alternate kv backend that is targetting high speed >>>>>> flash. I suspect the latter makes the most sense, and I believe there >>>>>> are >>>>>> various efforts at Intel looking at alternatives but no winner just >>>>>> yet. >>>>>> >>>>>> Looking a bit further out, I think a new kv library that natively >>>>>> targets >>>>>> peristent memory (e.g., something built on pmem.io) will be the right >>>>>> solution. Although at that point, it's probbaly a question of whether >>>>>> we >>>>>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter >>>>>> case a complete replacement for bluestore would make more sense. >>>>>> >>>>>>> For FTL, yes, it is a good idea, after we get the flame graph, we >>>>>>> could discuss which part could be improved by FTL, firmware, even open >>>>>>> channel. >>>>>> >>>>>> >>>>>> Yep! >>>>>> sage >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>: >>>>>>>> >>>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote: >>>>>>>>> >>>>>>>>> Hi Cephers, >>>>>>>>> >>>>>>>>> I did some experiment today to compare the latency between one >>>>>>>>> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so): >>>>>>>>> >>>>>>>>> For iodepth = 1, the random write latency of bluestore is 276.91us, >>>>>>>>> compare with 14.71 of SSD, big overhead. >>>>>>>>> >>>>>>>>> I also test iodepth = 16, Still, there is a big overhead.(143 us -> >>>>>>>>> 642 us) >>>>>>>>> >>>>>>>>> What is your opinion? >>>>>>>> >>>>>>>> >>>>>>>> There is a lot of work that bluestore is doing over the raw device as >>>>>>>> it >>>>>>>> is implementing all of the metadata tracking, checksumming, >>>>>>>> allocation, >>>>>>>> and so on. There's definitely lots of room for improvement, but I'm >>>>>>>> not sure you can expect to see latencies in the 10s of us. That >>>>>>>> said, it >>>>>>>> would be interesting to see an updated flamegraph to see where the >>>>>>>> time is >>>>>>>> being spent and where we can slim this down. On a new nvme it's >>>>>>>> possible >>>>>>>> we can do away with some of the complexity of, say, the allocator, >>>>>>>> since >>>>>>>> the FTL is performing a lot of the same work anyway. >>>>>>>> >>>>>>>> sage >>>>>>> >>>>>>> >>>>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> >> > > > > -- > Best wishes > Lisa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html