Re: latency compare between 2t NVME SSD P3500 and bluestore

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sage,

Indeed, I have an idea which hold a long time.

Do we really need a heavy k/v database to store metadata? Especially
for fast disks.... Introduce a third-party database also make
difficulty for maintenance (maybe because of my limited database
knowledge) ...

Let's suppose:
1) The max pg number in one osd is limited(in my experience, 100~200
pgs per osd is best performance)
2) The max number of objects in one pg is limited, because of disk space.

Then, how about this: pre-allocate metadata locations in metadata partition。

Part a SSD into two or three partitions(same as bluestore), instead of
using kv database, just store metadata directly in one disk
partition(we call it metadata partition). Inside this metadata
partition, we store several data structures:
1) One hash table of PGs, key is PG id, value is another hash
table(key is object index of this pg, value is object metadata, and
object location in data partition).
2) A free object location list.

And other extra things...

The max pgs belongs to one OSD can be limited by options, so I believe
the metadata partition should not be big. We could load all metadata
into RAM if RAM is really big, or part of them and controlled by LRU,
or just read, modify, and write back to disk when needed.

Do you think this idea reasonable? At least, I believe this kind of
new storage engine will be much faster.

Thanks
Pan

2017-07-12 21:55 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
> On Wed, 12 Jul 2017, 攀刘 wrote:
>> Hi Sage,
>>
>> Yes, I totally understand bluestore did much more things than a raw
>> disk, but the current overhead is a little too big to our usage. I
>> will compare bluestore with XFS(also has metadata tracking,
>> allocation, and so on), and to see if XFS also has such impact.
>>
>> I would like to give a flamegraph later, but from the perfcounter, we
>> could find most of time were spent in "kv_lat".
>
> That's rocksdb.  And yeah, I think it's pretty clear that either rocksdb
> needs some serious work to really keep up with nvme (or optane) or (more
> likely) we need an alternate kv backend that is targetting high speed
> flash.  I suspect the latter makes the most sense, and I believe there are
> various efforts at Intel looking at alternatives but no winner just yet.
>
> Looking a bit further out, I think a new kv library that natively targets
> peristent memory (e.g., something built on pmem.io) will be the right
> solution.  Although at that point, it's probbaly a question of whether we
> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
> case a complete replacement for bluestore would make more sense.
>
>> For FTL, yes, it is a good idea, after we get the flame graph, we
>> could discuss which part could be improved by FTL, firmware, even open
>> channel.
>
> Yep!
> sage
>
>
>
>
>>
>>
>>
>>
>>
>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:
>> > On Wed, 12 Jul 2017, 攀刘 wrote:
>> >> Hi Cephers,
>> >>
>> >> I did some experiment today to compare the latency between one
>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>> >>
>> >> For iodepth = 1, the random write latency of bluestore is 276.91us,
>> >> compare with 14.71 of SSD, big overhead.
>> >>
>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)
>> >>
>> >> What is your opinion?
>> >
>> > There is a lot of work that bluestore is doing over the raw device as it
>> > is implementing all of the metadata tracking, checksumming, allocation,
>> > and so on.  There's definitely lots of room for improvement, but I'm
>> > not sure you can expect to see latencies in the 10s of us.  That said, it
>> > would be interesting to see an updated flamegraph to see where the time is
>> > being spent and where we can slim this down.  On a new nvme it's possible
>> > we can do away with some of the complexity of, say, the allocator, since
>> > the FTL is performing a lot of the same work anyway.
>> >
>> > sage
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux