Re: Ceph Bluestore OSD CPU utilization

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Mon, 31 Jul 2017 14:23:55 -0500

On 07/31/2017 01:29 PM, Jianjian Huo wrote:
On Sat, Jul 29, 2017 at 8:34 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote:

On 07/28/2017 03:57 PM, Jianjian Huo wrote:

Hi Mark,

On Wed, Jul 26, 2017 at 8:55 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx>
wrote:

yeah, metrics and profiling data would be good at this point.  The
standard
gauntlet of collectl/iostat, gdbprof or poorman's profiling, perf,
blktrace,
etc.  Don't necessarily need everything but if anything interesting shows
up
it would be good to see it.

Also, turning on rocksdb bloom filters is worth doing if it hasn't been
done
yet (happening in master soon via
https://github.com/ceph/ceph/pull/16450).

FWIW, I'm tracking down what I think is a sequential write regression vs
earlier versions of bluestore but haven't figured out what's going on yet
or
even how much of a regression we are facing (these tests are on much
bigger
volumes than previously tested).

Mark

For bluestore sequential writes, from our testing with master branch
two days ago, ec sequential writes (16K and 128K) were 2~3 times
slower than 3x sequential writes. From your earlier testing, bluestore
ec sequential writes were faster than 3x in all IO size cases. Is this
some sort of regression you are aware of?

Jianjian

I wouldn't necessarily expect small EC sequential writes to necessarily do
well vs 3x replication.  It might depend on the disk configuration and
definitely on client side WB cache (This is tricky because RBD cache has
some locking limitations that become apparent at high IOPS rates / volume).
For large writes though I've seen EC faster (somewhere between 2x and 3x
replication).  These numbers are almost 5 months old now (and there have
been some bluestore performance improvements since then), but here's what I
was seeing for RBD EC overwrites last March (scroll to the right for
graphs):

https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZbE50QUdtZlBxdFU

Thanks for sharing this data, Mark.
 From your data of last March, for RBD EC overwrite on NVMe, EC
sequential writes are faster than 3X for all IO sizes including small
4K/16KB. Is this right? but I am not seeing this on my setup(all nvme
drives, 12 of them per node), in my case EC sequential writes are 2~3
times slower than 3X. Maybe I have too many drives per node?

Jianjian

Maybe, or maybe it's a regression!  I'm focused on the bitmap allocator 
right now, but if I have time I'll try to reproduce those older test 
results on master.  Maybe if you have time, see if you have the same 
results if you try bluestore from Jan/Feb?

Mark

FWIW, the regression I might be seeing (if it is actually a regression)
appears to be limited to RBD block creation rather than writes to existing
blocks.  IE pre-filling volumes is slower than just creating objects via
rados bench of the same size.  It's pretty limited in scope.

Mark

On 07/26/2017 09:40 PM, Brad Hubbard wrote:

Bumping this as I was talking to Junqin in IRC today and he reported it
is
still
an issue. I suggested analysis of metrics and profiling data to try to
determine
the bottleneck for bluestore and also suggested Junqin open a tracker so
we can
investigate this thoroughly.

Mark, Did you have any additional thoughts on how this might best be
attacked?

On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@xxxxxxxxxx>
wrote:

Hi Mark,

Thanks for your reply.

Our SSD model is:
Device Model:     SSDSC2BA800G4N
Intel SSD DC S3710 Series 800GB

And BlueStore OSD configure is as I posted before
[osd.0]
host = ceph-1
osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
bluestore block db path = /dev/sda5    # a 10G SSD partition
bluestore block wal path = /dev/sda6  # a 10G SSD partition
bluestore block path = /dev/sdd            # a HDD disk

The iostat is a quick snapshot of terminal screen on a 8K write. I
forget
the detail test configuration.
I only can make sure is it is a 8K random write.
But we have re-setup the cluster, so I can't get the data right now,
but
we will do test again later these days.

Is there any special configure on BlueStore on your lab test? Like, how
BlueStore OSD configured in your lab test?
Or could you share lab test BlueStore configuration? Like file
ceph.conf?

Thanks a lot!

B.R.
Junqin Zhang

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx
[mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
Sent: Wednesday, July 12, 2017 11:29 PM
To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
Subject: Re: Ceph Bluestore OSD CPU utilization

Hi Junqin

On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:

Hi Mark,

We also compared iostat of filestore and bluestore.
Disk write rate of bluestore is only around 10% of filestore in same
test case.

Here is FileStore iostat during write
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           13.06    0.00    9.84   11.52    0.00   65.58

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00 8196.00     0.00 73588.00
17.96     0.52    0.06    0.00    0.06   0.04  31.90
sdb               0.00     0.00    0.00 8298.00     0.00 75572.00
18.21     0.54    0.07    0.00    0.07   0.04  33.00
sdh               0.00  4894.00    0.00  741.00     0.00 30504.00
82.33   207.60  314.51    0.00  314.51   1.35 100.10
sdj               0.00  1282.00    0.00  938.00     0.00 15652.00
33.37    14.40   16.04    0.00   16.04   0.90  84.10
sdk               0.00  5156.00    0.00  847.00     0.00 34560.00
81.61   199.04  283.83    0.00  283.83   1.18 100.10
sdd               0.00  6889.00    0.00  729.00     0.00 38216.00
104.84   138.60  198.14    0.00  198.14   1.37 100.00
sde               0.00  6909.00    0.00  763.00     0.00 38608.00
101.20   139.16  190.55    0.00  190.55   1.31 100.00
sdf               0.00  3237.00    0.00  708.00     0.00 30548.00
86.29   175.15  310.36    0.00  310.36   1.41  99.80
sdg               0.00  4875.00    0.00  745.00     0.00 32312.00
86.74   207.70  291.26    0.00  291.26   1.34 100.00
sdi               0.00  7732.00    0.00  812.00     0.00 42136.00
103.78   140.94  181.96    0.00  181.96   1.23 100.00

Here is BlueStore iostat during write
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            6.50    0.00    3.22    2.36    0.00   87.91

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00 2938.00     0.00 25072.00
17.07     0.14    0.05    0.00    0.05   0.04  12.70
sdb               0.00     0.00    0.00 2821.00     0.00 26112.00
18.51     0.15    0.05    0.00    0.05   0.05  12.90
sdh               0.00     1.00    0.00  510.00     0.00  3600.00
14.12     5.45   10.68    0.00   10.68   0.24  12.00
sdj               0.00     0.00    0.00  424.00     0.00  3072.00
14.49     4.24   10.00    0.00   10.00   0.22   9.30
sdk               0.00     0.00    0.00  496.00     0.00  3584.00
14.45     4.10    8.26    0.00    8.26   0.18   9.10
sdd               0.00     0.00    0.00  419.00     0.00  3080.00
14.70     3.60    8.60    0.00    8.60   0.19   7.80
sde               0.00     0.00    0.00  650.00     0.00  3784.00
11.64    24.39   40.19    0.00   40.19   1.15  74.60
sdf               0.00     0.00    0.00  494.00     0.00  3584.00
14.51     5.92   11.98    0.00   11.98   0.26  12.90
sdg               0.00     0.00    0.00  493.00     0.00  3584.00
14.54     5.11   10.37    0.00   10.37   0.23  11.20
sdi               0.00     0.00    0.00  744.00     0.00  4664.00
12.54   121.41  177.66    0.00  177.66   1.35 100.10

sda and sdb are SSD, other are HDD.

earlier it looked like you were posting the configuration for an 8k
randrw test, but this is a pure write test?  Can you provide the test
configuration for these results?  Also, the SSD model would be useful
to
know.

Having said that, these results look pretty different than what I
typically see in the lab.  A big clue is the avgrq-sz.  On filestore
you are
seeing much larger write requests than with bluestore.  That might
indicate
that metadata writes are going to the HDD.  Is this still with the 10GB
DB
partition?

Mark

-----Original Message-----
From: Junqin JQ7 Zhang
Sent: Wednesday, July 12, 2017 10:45 AM
To: 'Mark Nelson'; Mark Nelson; Ceph Development
Subject: RE: Ceph Bluestore OSD CPU utilization

Hi Mark,

Actually, we tested filestore on same Ceph version v12.1.0 and same
cluster.
# ceph -v
ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
luminous (dev)

CPU utilization of each OSD on filestore can reach max to around 200%,
but CPU utilization of OSD on bluestore is only around 30%.
Then, BlueStore's performance is only about 20% of filestore.
We think there must be something wrong with our configuration.

I tried to change ceph config, like
osd op threads = 8
osd disk threads = 4

but still can't get a good result.

Any idea of this?

BTW. We changed some filestore related configured during test
filestore fd cache size = 2048576000 filestore fd cache shards = 16
filestore async threads = 0 filestore max sync interval = 15 filestore
wbthrottle enable = false filestore commit timeout = 1200
filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
1048576 filestore queue max bytes = 17179869184 max open files =
262144 filestore fadvise = false filestore ondisk finisher threads = 4
filestore op threads = 8

Thanks a lot!

B.R.
Junqin Zhang
-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx
[mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
Sent: Tuesday, July 11, 2017 11:47 PM
To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
Subject: Re: Ceph Bluestore OSD CPU utilization

On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:

Hi Mark,

Thanks for your reply.

The hardware is as below for each 3 hosts.
2 SATA SSD and 8 HDD

The model of SSD potentially could be very important here.  The
devices
we test in our lab are enterprise grade SSDs with power loss
protection.
   That means they don't have to flush data on sync requests.  O_DSYNC
writes are much faster as a result.  I don't know how bad of an impact
this
has on rocksdb wal/db, but it definitely hurts with filestore
journals.

Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Network: 20000Mb/s

I configured OSD like
[osd.0]
host = ceph-1
osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
bluestore block db path = /dev/sda5         # a 10G partition of SSD

Bluestore automatically roles rocksdb data over to the HDD with the db
gets full.  I bet with 10GB you'll see good performance at first and
then
you'll start seeing lots of extra reads/writes on the HDD once it
fills up
with metadata (the more extents that are written out the more likely
you'll
hit this boundary).  You'll want to make the db partitions use the
majority
of the SSD(s).

bluestore block wal path = /dev/sda6       # a 10G partition of SSD

The WAL can be smaller.  1-2GB is enough (potentially even less if you
adjust the rocksdb buffer settings, but 1-2GB should be small enough
to
devote most of your SSDs to DB storage).

bluestore block path = /dev/sdd                # a HDD disk

We use fio to test one or more 100G RBDs, an example of our fio
config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw
bs=8k
runtime=120
iodepth=16
numjobs=4

with the rbd engine I try to avoid numjobs as it can give erroneous
results in some cases.  it's probably better generally to stick with
multiple independent fio processes (though in this case for a randrw
workload it might not matter).

direct=1
rwmixread=0
new_group
group_reporting
[rbd_image0]
rbdname=testimage_100GB_0

Any suggestion?

What kind of performance are you seeing and what do you expect to get?

Mark

Thanks.

B.R.
Junqin zhang

-----Original Message-----
From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
Sent: Tuesday, July 11, 2017 7:32 PM
To: Junqin JQ7 Zhang; Ceph Development
Subject: Re: Ceph Bluestore OSD CPU utilization

Ugh, small sequential *reads* I meant to say.  :)

Mark

On 07/11/2017 06:31 AM, Mark Nelson wrote:

Hi Junqin,

Can you tell us your hardware configuration (models and quantities
of cpus, network cards, disks, ssds, etc) and the command and
options you used to measure performance?

In many cases bluestore is faster than filestore, but there are a
couple of cases where it is notably slower, the big one being when
doing small sequential writes without client-side readahead.

Mark

On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:

Hi,

I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore
and did some fio test.
During test,  I found the each OSD CPU utilization rate was only
aroud 30%.
And the performance seems not good to me.
Is  there any configuration to help increase OSD CPU utilization to
improve performance?
Change kernel.pid_max? Any BlueStore specific configuration?

Thanks a lot!

B.R.
Junqin Zhang
--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info
at
http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html