Re: Significant slowdown of osds since v0.67 Dumpling

Mike Dawson <mike.dawson@xxxxxxxxxxxx> · Thu, 22 Aug 2013 11:18:52 -0400

Jumping in pretty late on this thread, but I can confirm much higher CPU 
load on ceph-osd using 0.67.1 compared to 0.61.7 under a write-heavy RBD 
workload. Under my workload, it seems like it might be 2x-5x higher CPU 
load per process.

Thanks,
Mike Dawson

On 8/22/2013 4:41 AM, Oliver Daudey wrote:
Hey Samuel,

On wo, 2013-08-21 at 20:27 -0700, Samuel Just wrote:
I think the rbd cache one you'd need to run for a few minutes to get
meaningful results.  It should stabilize somewhere around the actual
throughput of your hardware.

Ok, I now also ran this test on Cuttlefish as well as Dumpling.

Cuttlefish:
# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC       OPS   OPS/SEC   BYTES/SEC
     1     13265  13252.45  3466029.67
     2     25956  12975.60  3589479.95
     3     38475  12818.61  3598590.70
     4     50184  12545.16  3530516.34
     5     59263  11852.22  3292258.13
<...>
   300   3421530  11405.08  3191555.35
   301   3430755  11397.83  3189251.09
   302   3443345  11401.73  3190694.98
   303   3455230  11403.37  3191478.97
   304   3467014  11404.62  3192136.82
   305   3475355  11394.57  3189525.71
   306   3488067  11398.90  3190553.96
   307   3499789  11399.96  3190770.21
   308   3510566  11397.93  3190289.49
   309   3519829  11390.98  3188620.93
   310   3532539  11395.25  3189544.03

Dumpling:
# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC       OPS   OPS/SEC   BYTES/SEC
     1     13201  13194.63  3353004.50
     2     25926  12957.05  3379695.03
     3     36624  12206.06  3182087.11
     4     46547  11635.35  3035794.95
     5     59290  11856.27  3090389.79
<...>
   300   3405215  11350.66  3130092.00
   301   3417789  11354.76  3131106.34
   302   3430067  11357.83  3131933.41
   303   3438792  11349.14  3129734.88
   304   3450237  11349.45  3129689.62
   305   3462840  11353.53  3130406.43
   306   3473151  11350.17  3128942.32
   307   3482327  11343.00  3126771.34
   308   3495020  11347.44  3127502.07
   309   3506894  11349.13  3127781.70
   310   3516532  11343.65  3126714.62

As you can see, the result is virtually identical.  What jumps out
during the cached tests, is that the CPU used by the OSDs is negligible
in both cases, while without caching, the OSDs get loaded quite well.
Perhaps the cache masks the problem we're seeing in Dumpling somehow?
And I'm not changing anything but the OSD-binary during my tests, so
cache-settings used in VMs are identical in both scenarios.

Hmm, 10k ios I guess is only 10 rbd chunks.  What replication level
are you using?  Try setting them to 10000000 (you only need to set the
xfs ones).

For the rand test, try increasing
filestore_wbthrottle_xfs_inodes_hard_limit and
filestore_wbthrottle_xfs_inodes_start_flusher to 10000 as well as
setting the above ios limits.

Ok, my current config:
         filestore wbthrottle xfs ios start flusher = 10000000
         filestore wbthrottle xfs ios hard limit = 10000000
         filestore wbthrottle xfs inodes hard limit = 10000
         filestore wbthrottle xfs inodes start flusher = 10000

Unfortunately, that still makes no difference at all in the original
standard-tests.

Random IO on Dumpling, after 120 secs of runtime:
# rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       545    534.98  1515804.02
     2      1162    580.80  1662416.60
     3      1731    576.52  1662966.61
     4      2317    579.04  1695129.94
     5      2817    562.56  1672754.87
<...>
   120     43564    362.91  1080512.00
   121     43774    361.76  1077368.28
   122     44419    364.06  1083894.31
   123     45046    366.22  1090518.68
   124     45287    364.01  1084437.37
   125     45334    361.54  1077035.12
   126     45336    359.40  1070678.36
   127     45797    360.60  1073985.78
   128     46388    362.40  1080056.75
   129     46984    364.21  1086068.63
   130     47604    366.11  1092712.51

Random IO on Cuttlefish, after 120 secs of runtime:
rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
   SEC       OPS   OPS/SEC   BYTES/SEC
     1      1066   1065.54  3115713.13
     2      2099   1049.31  2936300.53
     3      3218   1072.32  3028707.50
     4      4026   1003.23  2807859.15
     5      4272    793.80  2226962.63
<...>
   120     66935    557.79  1612483.74
   121     68011    562.01  1625419.34
   122     68428    558.59  1615376.62
   123     68579    557.06  1610780.38
   125     68777    549.73  1589816.94
   126     69745    553.52  1601671.46
   127     70855    557.91  1614293.12
   128     71962    562.20  1627070.81
   129     72529    562.22  1627120.59
   130     73146    562.66  1628818.79

Confirming your setting took properly:
# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
wbthrottle
   "filestore_wbthrottle_btrfs_bytes_start_flusher": "41943040",
   "filestore_wbthrottle_btrfs_bytes_hard_limit": "419430400",
   "filestore_wbthrottle_btrfs_ios_start_flusher": "500",
   "filestore_wbthrottle_btrfs_ios_hard_limit": "5000",
   "filestore_wbthrottle_btrfs_inodes_start_flusher": "500",
   "filestore_wbthrottle_xfs_bytes_start_flusher": "41943040",
   "filestore_wbthrottle_xfs_bytes_hard_limit": "419430400",
   "filestore_wbthrottle_xfs_ios_start_flusher": "10000000",
   "filestore_wbthrottle_xfs_ios_hard_limit": "10000000",
   "filestore_wbthrottle_xfs_inodes_start_flusher": "10000",
   "filestore_wbthrottle_btrfs_inodes_hard_limit": "5000",
   "filestore_wbthrottle_xfs_inodes_hard_limit": "10000",

As you can see, your suggested settings made little difference anywhere
after you let it settle and in the end, there is still a huge difference
in performance between Cuttlefish and Dumpling.  Caching seems to mask
the problem, though, at least with the type of IO generated by these
tests.  It also reduces CPU-usage by the OSDs to almost none.

    Regards,

       Oliver

-Sam

On Wed, Aug 21, 2013 at 8:13 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
Hey Samuel,

On wo, 2013-08-21 at 18:33 -0700, Samuel Just wrote:
I am dumb.  There *has* been a change in the osd which can account for
this: the wbthrottle limits.  We added some logic to force the kernel
to start flushing writes out earlier, normally a good thing.  In this
case, it's probably doing an fsync every 500 writes.

Can you run 3 tests?
1) rerun with --rbd-cache (rbd bench-write test --rbd-cache)

That flies, with very little load on the OSDs.  That load is much higher
without caching.  This test was on Dumpling.  You get the idea.

# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC       OPS   OPS/SEC   BYTES/SEC
     1     13832  13817.17  3678357.12
     2     27590  13793.28  3747373.45
     3     41144  13708.02  3781505.90
     4     54510  13622.96  3744543.19
     5     67249  13448.05  3714589.10
     6     78778  13128.73  3632892.25
     7     89507  12783.12  3544970.08
     8    103171  12895.06  3564180.90
     9    116306  12921.45  3568128.57
    10    128946  12894.02  3557625.15

2) rerun without rbd-cache, but add the following configs to the osds:
filestore_wbthrottle_xfs_ios_start_flusher = 10000
filestore_wbthrottle_xfs_ios_hard_limit = 10000
filestore_wbthrottle_btrfs_ios_start_flusher = 10000
filestore_wbthrottle_btrfs_ios_hard_limit = 10000

That didn't seem to change the results in any way on Dumpling.  I'm
running the OSD-storage on XFS, BTW.

3) rerun with the random pattern (rbd bench-write test --io-pattern=rand)

With Dumpling, it starts out high, then drops off:

# rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       584    569.09  1700360.17
     2      1165    582.37  1687174.07
     3      1568    468.14  1364737.79
     4      1649    411.79  1197760.41
     5      1819    363.30  1061872.78
     6      2369    394.17  1146998.37
     7      2796    395.00  1153270.94
     8      2939    360.59  1050822.48
     9      3221    357.87  1040777.01
    10      3515    339.64  991027.17
    11      3834    347.46  1021930.90
    12      4206    350.18  1024435.83
    13      4424    337.30  981849.71
    14      4493    320.86  936308.80
    15      4734    315.29  921501.51
    16      5285    330.31  969201.95
    17      5436    319.68  936765.93
    18      6011    333.91  980649.27
    19      6376    334.35  983297.54
    20      6398    311.90  917323.37
    21      6602    314.36  926410.86
    22      7167    325.67  962430.15
    24      7543    310.73  917734.40
    25      7704    307.98  909611.56
    26      8279    318.30  938557.27
    27      8826    323.99  953434.67
    28      9296    331.87  977987.40
    29      9476    322.55  951696.77
    30      9628    320.92  946692.41

Same for Cuttlefish, but with significantly more performance over the
entire run:

# rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       920    919.34  2464017.91
     2      1228    575.73  1597724.10
     3      2016    669.87  1835987.05
     4      2222    528.39  1443494.02
     5      2339    467.32  1268464.39
     6      3341    556.76  1532374.18
     7      3677    500.11  1385500.24
     8      3925    490.60  1361346.40
     9      4633    510.17  1430694.07
    10      5275    506.04  1412213.94
    11      5713    518.16  1449597.21
    12      6279    523.22  1466611.93
    13      7224    529.66  1485373.17
    14      7292    520.08  1461556.30
    15      8007    533.80  1498859.94
    16      8218    510.93  1436516.13
    17      8707    511.54  1439508.19
    18      9699    538.56  1513392.51
    19     10264    539.98  1522426.08
    20     10684    530.23  1496947.12
    21     10999    523.54  1478991.54
    22     11357    512.25  1450630.22
    23     12079    520.24  1475359.56
    24     12868    530.69  1504088.36
    25     13315    529.92  1504462.99
    26     13647    512.39  1451005.94
    27     14002    517.90  1467738.55
    28     14343    504.96  1430645.32
    29     14575    501.53  1419589.36
    30     15408    513.55  1452033.93

Weird that the config-options didn't change anything.  I had hoped to
try them on the production-cluster.  Anything I could try there?  Had to
revert the OSDs there to Cuttlefish again for now, as Dumpling is just
too slow.

    Regards,

       Oliver

I expect 1) to largely eliminate the difference (these are sequential
4k ios and should be coalesced into much bigger writes).  2) should
basically restore the same filestore behavior as in cuttlefish.  3)
should also show little difference since the io will be random.

In normal use, your vms should probably be using rbd caching anyway
(in addition to the write coalescing already occurring in the guest
os), so this may not be what you are seeing on your cluster.
-Sam

On Wed, Aug 21, 2013 at 6:00 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
You'd need to unmount the fs to actually clear the cache.  Did you see
a significant difference in load between the runs?  To confirm, the
rbd client is dumpling the entire time?
-Sam

On Wed, Aug 21, 2013 at 2:28 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
Hey Samuel,

I repeated the same test several times before my post and just now 2
more times.  It holds up and is also repeatable in reverse order, with
the same results.  Remember, I restart all OSDs between tests, so any
caches should get destroyed and besides, I'm writing.  That shouldn't
involve very large-scale caching, right?  I also waited for the cluster
to come to a healthy state after restarting the OSDs, so it's not
related to rebalancing or peering-activity, either.

    Regards,

       Oliver

On wo, 2013-08-21 at 14:07 -0700, Samuel Just wrote:
Try it again in the reverse order, I strongly suspect caching effects.
-Sam

On Wed, Aug 21, 2013 at 1:34 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
Hey Samuel,

Finally got it reproduced on my test-cluster, which was otherwise
unloaded at the time.  First, with Dumpling:

# rbd create --size 102400 test
# ceph-osd --version
ceph version 0.67.1-6-g0c4f2f3
(0c4f2f34b78b634efe7f4d56694e2edeeda5a130)
# rbd bench-write test
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       204    200.67  592272.58
     2       401    200.24  607462.05
     3       625    204.62  639655.53
     4       824    205.20  636488.76
     5      1026    204.08  630587.77
     6      1217    202.69  628965.39
     7      1429    203.19  633671.84
     8      1626    202.93  635419.90
     9      1837    203.11  630421.40
    10      2045    204.16  631364.11
    11      2247    204.12  625833.91
    12      2456    204.29  629952.57
    13      2655    204.15  634639.37
    14      2888    206.15  637373.53
    15      3115    207.12  640826.18
    16      3330    207.75  640120.38
    17      3529    207.57  636277.18
    18      3750    208.24  635036.47
    19      3965    208.65  631118.13
    20      4216    210.75  638197.93
    21      4534    215.83  656890.46
    22      4809    218.50  666612.75
    23      5076    220.63  678678.36
    24      5320    221.39  686760.90
    25      5651    226.02  701345.26
    26      5967    229.44  706839.16
    27      6271    232.05  707958.62
    28      6522    232.82  709890.37
    29      6773    233.37  711718.21
    30      7151    238.23  722924.18

Then I replaced the OSD with Cuttlefish on all nodes and restarted it
everywhere.

# ceph-osd --version
ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
# rbd bench-write test
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       556    555.63  1915665.58
     2      1095    545.08  1839139.42
     3      1696    564.45  1840316.22
     4      2312    577.25  1832641.60
     5      2870    573.46  1800558.64
     6      3461    576.81  1780993.33
     7      4062    579.26  1821246.36
     8      4678    581.51  1862003.10
     9      5269    585.12  1902219.78
    10      5883    587.11  1914700.43
    11      6471    587.59  1913208.96
    12      7022    585.12  1914375.79
    13      7568    580.15  1899650.34
    14      8122    579.87  1864572.53
    15      8659    576.18  1826927.83
    16      9259    576.81  1796653.40
    17      9826    577.60  1785328.57
    18     10401    576.30  1777479.51
    19     10962    576.94  1777431.02
    20     11399    569.84  1756633.76
    21     11937    567.45  1740147.43
    22     12491    567.47  1726288.46
    23     13109    569.92  1727347.20
    24     13629    567.64  1721996.79
    25     14153    566.11  1720782.06
    26     14669    563.23  1716275.34
    27     15191    562.61  1717677.74
    28     15768    562.49  1713059.35
    29     16281    561.40  1707289.23
    31     16719    536.96  1628452.43
^C

As you can see, the difference is quite dramatic and this confirms it's
probably something RBD-related.  The test-cluster is a standard 3-node
cluster with 3xMON, 3xMDS, 3xOSD, tied together with 2x1Gbit/s ethernet.
I replaced only the OSD-binary for these tests.  The rest of the
Ceph-installation is at Dumpling.

    Regards,

      Oliver

On wo, 2013-08-21 at 12:05 -0700, Samuel Just wrote:
There haven't been any significant osd side changes that I can think
of.  Is cpu usage still high?  If so, can you post the profiler
results again?
-Sam

On Wed, Aug 21, 2013 at 12:02 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
Hey Samuel,

I had a good run on the production-cluster with it and unfortunately, it
still doesn't seem to have solved the problem.  It seemed OK for a while
and individual OSD CPU-usage seemed quite low, but as the cluster's load
increased during the day, things got slower again.  Write-performance
within a VM crawled to 30MB/sec and at some point, I got only 10MB/sec
on reads in that same VM.  I also did RADOS bench-tests with `rados
--pool rbd bench 120 write' and those got several hundreds of MB's/sec
on the same cluster at the same time of day, so maybe the problem is
RBD-related.  Is there any code in the OSD that could influence
RBD-performance alone?  Do you know of any other significant changes to
the OSD between Cuttlefish and Dumpling that could result in this?

PS: I also did the same RADOS bench-tests on my test-cluster, both with
Cuttlefish and Dumpling without your fix and got almost identical
results.  This confirms that the problem might be in RBD, as Mark suggested.

     Regards,

        Oliver

On 20-08-13 19:40, Samuel Just wrote:
Can you try dumpling head without the option?
-Sam

On Tue, Aug 20, 2013 at 1:44 AM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
Hey Mark,

Sorry, but after some more tests I have to report that it only worked
partly.  The load seems lower with "wip-dumpling-pglog-undirty" in
place, but the Cuttlefish-osd still seems significantly faster and even
with "wip-dumpling-pglog-undirty" in place, things slow down way too
much under load.  Unfortunately, only my production-cluster seems busy
enough to actually show the problem clearly by slowing down.  Below is
`perf top'-output, fresh from my production-cluster under it's regular
load:

First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout =
false":
  16.53%  [kernel]                     [k]
intel_idle
   6.47%  libleveldb.so.1.9            [.]
0x380a1
   5.76%  [kernel]                     [k]
find_busiest_group
   4.11%  libc-2.11.3.so               [.]
memcmp
   3.95%  kvm                          [.]
0x1f6f31
   2.05%  [kernel]                     [k]
default_send_IPI_mask_sequence_phys
   2.03%  [kernel]                     [k]
_raw_spin_lock
   1.87%  libleveldb.so.1.9            [.]
leveldb::InternalKeyComparator::Compar
   1.57%  libc-2.11.3.so               [.]
memcpy
   1.37%  libleveldb.so.1.9            [.]
leveldb::Block::Iter::Next()
   1.26%  [kernel]                     [k]
hrtimer_interrupt
   1.12%  [kernel]                     [k]
__hrtimer_start_range_ns
   1.09%  [kernel]                     [k]
native_write_cr0
   1.05%  libstdc++.so.6.0.13          [.]
std::string::_M_mutate(unsigned long,
   1.00%  [kernel]                     [k]
native_write_msr_safe
   0.99%  [kernel]                     [k]
apic_timer_interrupt
   0.98%  [kernel]                     [k]
clockevents_program_event
   0.96%  [kernel]                     [k]
_raw_spin_unlock_irqrestore
   0.95%  ceph-osd                     [.]
PGLog::undirty()
   0.92%  [kernel]                     [k]
find_next_bit
   0.91%  libsnappy.so.1.1.2           [.]
snappy::RawUncompress(snappy::Source*,
   0.88%  [kernel]                     [k]
__schedule
   0.87%  [kernel]                     [k]
cpumask_next_and
   0.84%  [kernel]                     [k]
do_select
   0.80%  [kernel]                     [k]
fget_light
   0.77%  [kernel]                     [k]
reschedule_interrupt
   0.75%  [kernel]                     [k]
_raw_spin_lock_irqsave
   0.62%  libstdc++.so.6.0.13          [.] std::string::append(char
const*, unsig
   0.59%  [kvm_intel]                  [k]
vmx_vcpu_run
   0.58%  [kernel]                     [k]
copy_user_generic_string
   0.56%  [kernel]                     [k]
load_balance
   0.54%  [kernel]                     [k]
tg_load_down
   0.53%  libpthread-2.11.3.so         [.]
pthread_mutex_lock
   0.52%  [kernel]                     [k] sync_inodes_sb

Second, the 0.61.8 osd, under identical load:
  21.51%  [kernel]                     [k]
intel_idle
   6.66%  [kernel]                     [k]
find_busiest_group
   6.25%  kvm                          [.]
0x2d214b
   1.97%  [kernel]                     [k]
_raw_spin_lock
   1.47%  [kernel]                     [k]
native_write_msr_safe
   1.44%  [kernel]                     [k]
hrtimer_interrupt
   1.37%  [kernel]                     [k]
__hrtimer_start_range_ns
   1.34%  [kernel]                     [k]
do_select
   1.29%  [kernel]                     [k]
fget_light
   1.24%  [kernel]                     [k]
clockevents_program_event
   1.21%  [kernel]                     [k]
default_send_IPI_mask_sequence_phys
   1.18%  [kernel]                     [k]
cpumask_next_and
   1.18%  [kernel]                     [k]
_raw_spin_unlock_irqrestore
   1.15%  [kernel]                     [k]
find_next_bit
   1.14%  [kernel]                     [k]
__schedule
   1.11%  [kernel]                     [k]
_raw_spin_lock_irqsave
   0.98%  [kernel]                     [k]
apic_timer_interrupt
   0.86%  [kernel]                     [k]
copy_user_generic_string
   0.83%  [kernel]                     [k]
native_write_cr0
   0.76%  [kernel]                     [k]
sync_inodes_sb
   0.71%  [kernel]                     [k]
rcu_needs_cpu
   0.69%  libpthread-2.11.3.so         [.]
pthread_mutex_lock
   0.66%  [kernel]                     [k]
fput
   0.62%  [kernel]                     [k]
load_balance
   0.57%  [vdso]                       [.]
0x7fff3a976700
   0.56%  libc-2.11.3.so               [.]
memcpy
   0.56%  [kernel]                     [k]
reschedule_interrupt
   0.56%  [kernel]                     [k]
tg_load_down
   0.50%  [kernel]                     [k] iput

I see lots of new differences, but again don't know what to make of it
and what might be related or significant.  LevelDB seems to jump out
this time, amongst others.  Let me know if you need more info.

    Regards,

      Oliver

On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote:
Hi Oliver,

Glad that helped!  How much more efficient do the cuttlefish OSDs seem
at this point (with wip-dumpling-pglog-undirty)?  On modern Intel
platforms we were actually hoping to see CPU usage go down in many cases
due to the use of hardware CRC32 instructions.

Mark

On 08/19/2013 03:06 PM, Oliver Daudey wrote:
Hey Samuel,

Thanks!  I installed your version, repeated the same tests on my
test-cluster and the extra CPU-loading seems to have disappeared.  Then
I replaced one osd of my production-cluster with your modified version
and it's config-option and it seems to be a lot less CPU-hungry now.
Although the Cuttlefish-osds still seem to be even more CPU-efficient,
your changes have definitely helped a lot.  We seem to be looking in the
right direction, at least for this part of the problem.

BTW, I ran `perf top' on the production-node with your modified osd and
didn't see anything osd-related stand out on top.  "PGLog::undirty()"
was in there, but with much lower usage, right at the bottom of the
green part of the output.

Many thanks for your help so far!

     Regards,

       Oliver

On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
You're right, PGLog::undirty() looks suspicious.  I just pushed a
branch wip-dumpling-pglog-undirty with a new config
(osd_debug_pg_log_writeout) which if set to false will disable some
strictly debugging checks which occur in PGLog::undirty().  We haven't
actually seen these checks causing excessive cpu usage, so this may be
a red herring.
-Sam

On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
Hey Mark,

On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
On 08/17/2013 06:13 AM, Oliver Daudey wrote:
Hey all,

This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
created in the tracker.  Thought I would pass it through the list as
well, to get an idea if anyone else is running into it.  It may only
show under higher loads.  More info about my setup is in the bug-report
above.  Here goes:

I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
+MB/sec on simple linear writes to a file with `dd' inside a VM on this
cluster under regular load and the osds usually averaged 20-100%
CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
the osds shot up to 100% to 400% in `top' (multi-core system) and the
speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
complained that disk-access inside the VMs was significantly slower and
the backups of the RBD-store I was running, also got behind quickly.

After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
rest at 0.67 Dumpling, speed and load returned to normal. I have
repeated this performance-hit upon upgrade on a similar test-cluster
under no additional load at all. Although CPU-usage for the osds wasn't
as dramatic during these tests because there was no base-load from other
VMs, I/O-performance dropped significantly after upgrading during these
tests as well, and returned to normal after downgrading the osds.

I'm not sure what to make of it. There are no visible errors in the logs
and everything runs and reports good health, it's just a lot slower,
with a lot more CPU-usage.

Hi Oliver,

If you have access to the perf command on this system, could you try
running:

"sudo perf top"

And if that doesn't give you much,

"sudo perf record -g"

then:

"sudo perf report | less"

during the period of high CPU usage?  This will give you a call graph.
There may be symbols missing, but it might help track down what the OSDs
are doing.

Thanks for your help!  I did a couple of runs on my test-cluster,
loading it with writes from 3 VMs concurrently and measuring the results
at the first node with all 0.67 Dumpling-components and with the osds
replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
while, then copied anything that showed in red and green into this post.
Here are the results (sorry for the word-wraps):

First, with 0.61.7 osds:

   19.91%  [kernel]                    [k] intel_idle
   10.18%  [kernel]                    [k] _raw_spin_lock_irqsave
    6.79%  ceph-osd                    [.] ceph_crc32c_le
    4.93%  [kernel]                    [k]
default_send_IPI_mask_sequence_phys
    2.71%  [kernel]                    [k] copy_user_generic_string
    1.42%  libc-2.11.3.so              [.] memcpy
    1.23%  [kernel]                    [k] find_busiest_group
    1.13%  librados.so.2.0.0           [.] ceph_crc32c_le_intel
    1.11%  [kernel]                    [k] _raw_spin_lock
    0.99%  kvm                         [.] 0x1931f8
    0.92%  [igb]                       [k] igb_poll
    0.87%  [kernel]                    [k] native_write_cr0
    0.80%  [kernel]                    [k] csum_partial
    0.78%  [kernel]                    [k] __do_softirq
    0.63%  [kernel]                    [k] hpet_legacy_next_event
    0.53%  [ip_tables]                 [k] ipt_do_table
    0.50%  libc-2.11.3.so              [.] 0x74433

Second test, with 0.67 osds:

   18.32%  [kernel]                      [k] intel_idle
    7.58%  [kernel]                      [k] _raw_spin_lock_irqsave
    7.04%  ceph-osd                      [.] PGLog::undirty()
    4.39%  ceph-osd                      [.] ceph_crc32c_le_intel
    3.92%  [kernel]                      [k]
default_send_IPI_mask_sequence_phys
    2.25%  [kernel]                      [k] copy_user_generic_string
    1.76%  libc-2.11.3.so                [.] memcpy
    1.56%  librados.so.2.0.0             [.] ceph_crc32c_le_intel
    1.40%  libc-2.11.3.so                [.] vfprintf
    1.12%  libc-2.11.3.so                [.] 0x7217b
    1.05%  [kernel]                      [k] _raw_spin_lock
    1.01%  [kernel]                      [k] find_busiest_group
    0.83%  kvm                           [.] 0x193ab8
    0.80%  [kernel]                      [k] native_write_cr0
    0.76%  [kernel]                      [k] __do_softirq
    0.73%  libc-2.11.3.so                [.] _IO_default_xsputn
    0.70%  [kernel]                      [k] csum_partial
    0.68%  [igb]                         [k] igb_poll
    0.58%  [kernel]                      [k] hpet_legacy_next_event
    0.54%  [kernel]                      [k] __schedule

What jumps right out, is the "PGLog::undirty()", which doesn't show up
with 0.61.7 at all, but is an extra drag right at top-usage in 0.67.
Note that I didn't manage to fully load the test-cluster CPU-wise,
because of network-constraints and I don't want to take any extra risks
on the production-cluster and test it there, but it seems we found a
possible culprit.

Any ideas?  Thanks again!

     Regards,

        Oliver

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com