Re: latency compare between 2t NVME SSD P3500 and bluestore

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 14 Jul 2017 08:31:00 -0500

On 07/14/2017 07:54 AM, 攀刘 wrote:
Hi Sage and Mark,

I did experiment locally for fio+libfio_ceph_bluestore.fio,
iodepth = 32, numjobs=64
and got the result below:

1) Without using gdbprof

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 75526 root      20   0 12.193g 7.056g 271672 R 99.0  5.6   4:20.98
bstore_kv_sync
 75527 root      20   0 12.193g 7.056g 271672 S 61.6  5.6   2:57.03
bstore_kv_final
 75504 root      20   0 12.193g 7.056g 271672 S 35.4  5.6   1:46.34 bstore_aio
 75524 root      20   0 12.193g 7.056g 271672 S 34.1  5.6   1:34.27 finisher
 75567 root      20   0 12.193g 7.056g 271672 S 10.6  5.6   0:26.32 fio

We could find the thread of store_kv_sync nearly occupied a core
completely(99%).

  write: IOPS=80.1k, BW=1251MiB/s (1312MB/s)(319GiB/260926msec)
    clat (usec): min=1792, max=52694, avg=25505.15, stdev=1731.77
     lat (usec): min=1852, max=52780, avg=25568.08, stdev=1732.09

From fio output, the avg flat is nearly 25.5ms.

From the perfcounter in the link:
http://paste.openstack.org/show/615380/

We could find :
kv_commit_lat: 12.1 ms
kv_lat:               12.1 ms
state_kv_queued_lat :  9.98 X 2 = 19.9 ms
state_kv_commiting_lat: 4.67 ms

So we should try to improve state_kv_queued_lat, right?

2) using gdbprof:
http://paste.openstack.org/show/615386/
I pasted the stack of thread KVFinalizeThread and KVSyncThread.

Lots of time spend bogged down in key comparison operations in rocksdb. 
I see that too, but usually in very high throughput scenarios.  How fast 
is your CPU?

I'm hoping that the PR from Li will allow us to make the buffers/sst 
files smaller and we can reduce some of that overhead:

https://github.com/ceph/rocksdb/pull/19

Mark

I will continue investigate, please let me know if you have any opinion.

Thanks
Pan

2017-07-14 17:31 GMT+08:00 Xiaoxi Chen <superdebuger@xxxxxxxxx>:
in state_kv_commit stage, db-> submit_transcation will be called and
all rocksdb insert-key logical is done here, as shown in gdbprof, key
comparison and lookup is happening here, but as db->submit_transcation
will set sync=false, which will leave the change to RocksDB WAL
(potential) is only in memory, not persistent to disk

The *submit* you refer to, is just submit an empty transaction with
sync=true, to flush all previous WAL persistently into disk.

Clearly the kv_commit is CPU intensive and kv_submit is (sequential)
IO intensive. So depending on the CPU/Disk speed ratio, one may see
different profiling result.  My previous test on HDD show opposite
result that kv_lat is pretty long.

2017-07-14 16:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>:
Here is the output of gdbprof: @Mark please have a look.
I copied _kv_sync_thread and _kv_finalize_thread here.
http://paste.openstack.org/show/615362/

Lisa

On Fri, Jul 14, 2017 at 9:54 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
Hi Li,

You may want to try my wallclock profiler to see where time is being spent
during your test.  It is located here:

https://github.com/markhpc/gdbprof

You can run it like:

sudo gdb -ex 'set pagination off' -ex 'attach <pid>' -ex 'source
/home/ubuntu/src/markhpc/gdbprof/gdbprof.py' -ex 'profile begin' -ex 'quit'

Mark

On 07/13/2017 08:47 PM, xiaoyan li wrote:

Hi,
I am concerned about the rocksdb impact on bluestore whole IO path. I
did some test with bluestore fio plugin.
For example, I got following data from the log when I did bluestore
fio test with numjobs=64 and iopath=32. It seems that for every txc,
most of the time spends on queued and commiting states.
state time span(us)
state_prepare_lat 386
state_aio_wait_lat 430
state_io_done_lat 0
state_kv_queued_lat 7926
state_kv_commiting_lat 30653
state_kv_done_lat 4

        "state_kv_queued_lat": {
            "avgcount": 349076566,
            "sum": 1214245.959889817,
            "avgtime": 0.003478451
        },
        "state_kv_commiting_lat": {
            "avgcount": 174538283,
            "sum": 5612849.022306266,
            "avgtime": 0.032158268
        },

And same time, to submit (174538283/3509556 = 49) txcs every time only
takes 1024us, which is much less than commiting_lat 30653us.
        "kv_lat": {
            "avgcount": 3509556,
            "sum": 3594.365142193,
            "avgtime": 0.001024165
        },

The time between state_kv_queued_lat and state_kv_commiting_lat:

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741

I am still investigating why it spends so long time on
kv_commiting_lat, but from above data I doubt it is the problem of
rocksdb.
Please correct me if I misunderstood anything.

Lisa

On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@xxxxxxxxx>
wrote:

FWIW, one thing that KVDB can provide is transaction support, which is
important as we need to update several metadata(Onode, allocator map,
and WAL for small write) transactionally.

2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@xxxxxxxxx>:

Hi Sage,

Indeed, I have an idea which hold a long time.

Do we really need a heavy k/v database to store metadata? Especially
for fast disks.... Introduce a third-party database also make
difficulty for maintenance (maybe because of my limited database
knowledge) ...

Let's suppose:
1) The max pg number in one osd is limited(in my experience, 100~200
pgs per osd is best performance)
2) The max number of objects in one pg is limited, because of disk
space.

Then, how about this: pre-allocate metadata locations in metadata
partition。

Part a SSD into two or three partitions(same as bluestore), instead of
using kv database, just store metadata directly in one disk
partition(we call it metadata partition). Inside this metadata
partition, we store several data structures:
1) One hash table of PGs, key is PG id, value is another hash
table(key is object index of this pg, value is object metadata, and
object location in data partition).
2) A free object location list.

And other extra things...

The max pgs belongs to one OSD can be limited by options, so I believe
the metadata partition should not be big. We could load all metadata
into RAM if RAM is really big, or part of them and controlled by LRU,
or just read, modify, and write back to disk when needed.

Do you think this idea reasonable? At least, I believe this kind of
new storage engine will be much faster.

Thanks
Pan

2017-07-12 21:55 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:

On Wed, 12 Jul 2017, 攀刘 wrote:

Hi Sage,

Yes, I totally understand bluestore did much more things than a raw
disk, but the current overhead is a little too big to our usage. I
will compare bluestore with XFS(also has metadata tracking,
allocation, and so on), and to see if XFS also has such impact.

I would like to give a flamegraph later, but from the perfcounter, we
could find most of time were spent in "kv_lat".

That's rocksdb.  And yeah, I think it's pretty clear that either
rocksdb
needs some serious work to really keep up with nvme (or optane) or
(more
likely) we need an alternate kv backend that is targetting high speed
flash.  I suspect the latter makes the most sense, and I believe there
are
various efforts at Intel looking at alternatives but no winner just
yet.

Looking a bit further out, I think a new kv library that natively
targets
peristent memory (e.g., something built on pmem.io) will be the right
solution.  Although at that point, it's probbaly a question of whether
we
have pmem for metadata and 3D NAND for data or pure pmem; in the latter
case a complete replacement for bluestore would make more sense.

For FTL, yes, it is a good idea, after we get the flame graph, we
could discuss which part could be improved by FTL, firmware, even open
channel.

Yep!
sage

2017-07-12 20:02 GMT+08:00 Sage Weil <sage@xxxxxxxxxxxx>:

On Wed, 12 Jul 2017, 攀刘 wrote:

Hi Cephers,

I did some experiment today to compare the latency between one
P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):

For iodepth = 1, the random write latency of bluestore is 276.91us,
compare with 14.71 of SSD, big overhead.

I also test iodepth = 16, Still, there is a big overhead.(143 us ->
642 us)

What is your opinion?

There is a lot of work that bluestore is doing over the raw device as
it
is implementing all of the metadata tracking, checksumming,
allocation,
and so on.  There's definitely lots of room for improvement, but I'm
not sure you can expect to see latencies in the 10s of us.  That
said, it
would be interesting to see an updated flamegraph to see where the
time is
being spent and where we can slim this down.  On a new nvme it's
possible
we can do away with some of the complexity of, say, the allocator,
since
the FTL is performing a lot of the same work anyway.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html