FWIW, one thing that KVDB can provide is transaction support, which is important as we need to update several metadata(Onode, allocator map, and WAL for small write) transactionally. 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@xxxxxxxxx>: > Hi Sage, > > Indeed, I have an idea which hold a long time. > > Do we really need a heavy k/v database to store metadata? Especially > for fast disks.... Introduce a third-party database also make > difficulty for maintenance (maybe because of my limited database > knowledge) ... > > Let's suppose: > 1) The max pg number in one osd is limited(in my experience, 100~200 > pgs per osd is best performance) > 2) The max number of objects in one pg is limited, because of disk space. > > Then, how about this: pre-allocate metadata locations in metadata partition。 > > Part a SSD into two or three partitions(same as bluestore), instead of > using kv database, just store metadata directly in one disk > partition(we call it metadata partition). Inside this metadata > partition, we store several data structures: > 1) One hash table of PGs, key is PG id, value is another hash > table(key is object index of this pg, value is object metadata, and > object location in data partition). > 2) A free object location list. > > And other extra things... > > The max pgs belongs to one OSD can be limited by options, so I believe > the metadata partition should not be big. We could load all metadata > into RAM if RAM is really big, or part of them and controlled by LRU, > or just read, modify, and write back to disk when needed. > > Do you think this idea reasonable? At least, I believe this kind of > new storage engine will be much faster. > > Thanks > Pan > > 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>: >> On Wed, 12 Jul 2017, 攀刘 wrote: >>> Hi Sage, >>> >>> Yes, I totally understand bluestore did much more things than a raw >>> disk, but the current overhead is a little too big to our usage. I >>> will compare bluestore with XFS(also has metadata tracking, >>> allocation, and so on), and to see if XFS also has such impact. >>> >>> I would like to give a flamegraph later, but from the perfcounter, we >>> could find most of time were spent in "kv_lat". >> >> That's rocksdb. And yeah, I think it's pretty clear that either rocksdb >> needs some serious work to really keep up with nvme (or optane) or (more >> likely) we need an alternate kv backend that is targetting high speed >> flash. I suspect the latter makes the most sense, and I believe there are >> various efforts at Intel looking at alternatives but no winner just yet. >> >> Looking a bit further out, I think a new kv library that natively targets >> peristent memory (e.g., something built on pmem.io) will be the right >> solution. Although at that point, it's probbaly a question of whether we >> have pmem for metadata and 3D NAND for data or pure pmem; in the latter >> case a complete replacement for bluestore would make more sense. >> >>> For FTL, yes, it is a good idea, after we get the flame graph, we >>> could discuss which part could be improved by FTL, firmware, even open >>> channel. >> >> Yep! >> sage >> >> >> >> >>> >>> >>> >>> >>> >>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>: >>> > On Wed, 12 Jul 2017, 攀刘 wrote: >>> >> Hi Cephers, >>> >> >>> >> I did some experiment today to compare the latency between one >>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so): >>> >> >>> >> For iodepth = 1, the random write latency of bluestore is 276.91us, >>> >> compare with 14.71 of SSD, big overhead. >>> >> >>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us) >>> >> >>> >> What is your opinion? >>> > >>> > There is a lot of work that bluestore is doing over the raw device as it >>> > is implementing all of the metadata tracking, checksumming, allocation, >>> > and so on. There's definitely lots of room for improvement, but I'm >>> > not sure you can expect to see latencies in the 10s of us. That said, it >>> > would be interesting to see an updated flamegraph to see where the time is >>> > being spent and where we can slim this down. On a new nvme it's possible >>> > we can do away with some of the complexity of, say, the allocator, since >>> > the FTL is performing a lot of the same work anyway. >>> > >>> > sage >>> >>> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html