"state_kv_commiting_lat - kv_lat" mean the latency for thread " _kv_finalize_thread". If is this correctly? Jianpeng -----Original Message----- From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of xiaoyan li Sent: Friday, July 14, 2017 9:47 AM To: Xiaoxi Chen <superdebuger@xxxxxxxxx> Cc: 攀刘 <liupan1111@xxxxxxxxx>; Sage Weil <sage@xxxxxxxxxxxx>; Ceph Development <ceph-devel@xxxxxxxxxxxxxxx>; p.zhou@xxxxxxxxxxxxxxx; 20702390@xxxxxx Subject: Re: latency compare between 2t NVME SSD P3500 and bluestore Hi, I am concerned about the rocksdb impact on bluestore whole IO path. I did some test with bluestore fio plugin. For example, I got following data from the log when I did bluestore fio test with numjobs=64 and iopath=32. It seems that for every txc, most of the time spends on queued and commiting states. state time span(us) state_prepare_lat 386 state_aio_wait_lat 430 state_io_done_lat 0 state_kv_queued_lat 7926 state_kv_commiting_lat 30653 state_kv_done_lat 4 "state_kv_queued_lat": { "avgcount": 349076566, "sum": 1214245.959889817, "avgtime": 0.003478451 }, "state_kv_commiting_lat": { "avgcount": 174538283, "sum": 5612849.022306266, "avgtime": 0.032158268 }, And same time, to submit (174538283/3509556 = 49) txcs every time only takes 1024us, which is much less than commiting_lat 30653us. "kv_lat": { "avgcount": 3509556, "sum": 3594.365142193, "avgtime": 0.001024165 }, The time between state_kv_queued_lat and state_kv_commiting_lat: https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349 https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366 https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741 I am still investigating why it spends so long time on kv_commiting_lat, but from above data I doubt it is the problem of rocksdb. Please correct me if I misunderstood anything. Lisa On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@xxxxxxxxx> wrote: > FWIW, one thing that KVDB can provide is transaction support, which is > important as we need to update several metadata(Onode, allocator map, > and WAL for small write) transactionally. > > > > 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@xxxxxxxxx>: >> Hi Sage, >> >> Indeed, I have an idea which hold a long time. >> >> Do we really need a heavy k/v database to store metadata? Especially >> for fast disks.... Introduce a third-party database also make >> difficulty for maintenance (maybe because of my limited database >> knowledge) ... >> >> Let's suppose: >> 1) The max pg number in one osd is limited(in my experience, 100~200 >> pgs per osd is best performance) >> 2) The max number of objects in one pg is limited, because of disk space. >> >> Then, how about this: pre-allocate metadata locations in metadata >> partition。 >> >> Part a SSD into two or three partitions(same as bluestore), instead >> of using kv database, just store metadata directly in one disk >> partition(we call it metadata partition). Inside this metadata >> partition, we store several data structures: >> 1) One hash table of PGs, key is PG id, value is another hash >> table(key is object index of this pg, value is object metadata, and >> object location in data partition). >> 2) A free object location list. >> >> And other extra things... >> >> The max pgs belongs to one OSD can be limited by options, so I >> believe the metadata partition should not be big. We could load all >> metadata into RAM if RAM is really big, or part of them and >> controlled by LRU, or just read, modify, and write back to disk when needed. >> >> Do you think this idea reasonable? At least, I believe this kind of >> new storage engine will be much faster. >> >> Thanks >> Pan >> >> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>: >>> On Wed, 12 Jul 2017, 攀刘 wrote: >>>> Hi Sage, >>>> >>>> Yes, I totally understand bluestore did much more things than a raw >>>> disk, but the current overhead is a little too big to our usage. I >>>> will compare bluestore with XFS(also has metadata tracking, >>>> allocation, and so on), and to see if XFS also has such impact. >>>> >>>> I would like to give a flamegraph later, but from the perfcounter, >>>> we could find most of time were spent in "kv_lat". >>> >>> That's rocksdb. And yeah, I think it's pretty clear that either >>> rocksdb needs some serious work to really keep up with nvme (or >>> optane) or (more >>> likely) we need an alternate kv backend that is targetting high >>> speed flash. I suspect the latter makes the most sense, and I >>> believe there are various efforts at Intel looking at alternatives but no winner just yet. >>> >>> Looking a bit further out, I think a new kv library that natively >>> targets peristent memory (e.g., something built on pmem.io) will be >>> the right solution. Although at that point, it's probbaly a >>> question of whether we have pmem for metadata and 3D NAND for data >>> or pure pmem; in the latter case a complete replacement for bluestore would make more sense. >>> >>>> For FTL, yes, it is a good idea, after we get the flame graph, we >>>> could discuss which part could be improved by FTL, firmware, even >>>> open channel. >>> >>> Yep! >>> sage >>> >>> >>> >>> >>>> >>>> >>>> >>>> >>>> >>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>: >>>> > On Wed, 12 Jul 2017, 攀刘 wrote: >>>> >> Hi Cephers, >>>> >> >>>> >> I did some experiment today to compare the latency between one >>>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so): >>>> >> >>>> >> For iodepth = 1, the random write latency of bluestore is >>>> >> 276.91us, compare with 14.71 of SSD, big overhead. >>>> >> >>>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us >>>> >> -> 642 us) >>>> >> >>>> >> What is your opinion? >>>> > >>>> > There is a lot of work that bluestore is doing over the raw >>>> > device as it is implementing all of the metadata tracking, >>>> > checksumming, allocation, and so on. There's definitely lots of >>>> > room for improvement, but I'm not sure you can expect to see >>>> > latencies in the 10s of us. That said, it would be interesting >>>> > to see an updated flamegraph to see where the time is being spent >>>> > and where we can slim this down. On a new nvme it's possible we >>>> > can do away with some of the complexity of, say, the allocator, since the FTL is performing a lot of the same work anyway. >>>> > >>>> > sage >>>> >>>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >> info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > info at http://vger.kernel.org/majordomo-info.html -- Best wishes Lisa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f