Re: Significant slowdown of osds since v0.67 Dumpling

Oliver Daudey <oliver@xxxxxxxxx> · Fri, 23 Aug 2013 19:26:00 +0200

Hey Sage,

I'm all for it and will help testing.

   Regards,

      Oliver

On 22-08-13 17:23, Sage Weil wrote:
> We should perhaps hack the old (cuttlefish and earlier) flushing behavior 
> into the new code so that we can confirm that it is really the writeback 
> that is causing the problem and not something else...
> 
> sage
> 
> On Thu, 22 Aug 2013, Oliver Daudey wrote:
> 
>> Hey Samuel,
>>
>> On wo, 2013-08-21 at 20:27 -0700, Samuel Just wrote:
>>> I think the rbd cache one you'd need to run for a few minutes to get
>>> meaningful results.  It should stabilize somewhere around the actual
>>> throughput of your hardware.
>>
>> Ok, I now also ran this test on Cuttlefish as well as Dumpling.
>>
>> Cuttlefish:
>> # rbd bench-write test --rbd-cache
>> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
>>   SEC       OPS   OPS/SEC   BYTES/SEC
>>     1     13265  13252.45  3466029.67
>>     2     25956  12975.60  3589479.95
>>     3     38475  12818.61  3598590.70
>>     4     50184  12545.16  3530516.34
>>     5     59263  11852.22  3292258.13
>> <...>
>>   300   3421530  11405.08  3191555.35
>>   301   3430755  11397.83  3189251.09
>>   302   3443345  11401.73  3190694.98
>>   303   3455230  11403.37  3191478.97
>>   304   3467014  11404.62  3192136.82
>>   305   3475355  11394.57  3189525.71
>>   306   3488067  11398.90  3190553.96
>>   307   3499789  11399.96  3190770.21
>>   308   3510566  11397.93  3190289.49
>>   309   3519829  11390.98  3188620.93
>>   310   3532539  11395.25  3189544.03
>>
>> Dumpling:
>> # rbd bench-write test --rbd-cache
>> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
>>   SEC       OPS   OPS/SEC   BYTES/SEC
>>     1     13201  13194.63  3353004.50
>>     2     25926  12957.05  3379695.03
>>     3     36624  12206.06  3182087.11
>>     4     46547  11635.35  3035794.95
>>     5     59290  11856.27  3090389.79
>> <...>
>>   300   3405215  11350.66  3130092.00
>>   301   3417789  11354.76  3131106.34
>>   302   3430067  11357.83  3131933.41
>>   303   3438792  11349.14  3129734.88
>>   304   3450237  11349.45  3129689.62
>>   305   3462840  11353.53  3130406.43
>>   306   3473151  11350.17  3128942.32
>>   307   3482327  11343.00  3126771.34
>>   308   3495020  11347.44  3127502.07
>>   309   3506894  11349.13  3127781.70
>>   310   3516532  11343.65  3126714.62
>>
>> As you can see, the result is virtually identical.  What jumps out
>> during the cached tests, is that the CPU used by the OSDs is negligible
>> in both cases, while without caching, the OSDs get loaded quite well.
>> Perhaps the cache masks the problem we're seeing in Dumpling somehow?
>> And I'm not changing anything but the OSD-binary during my tests, so
>> cache-settings used in VMs are identical in both scenarios.
>>
>>>
>>> Hmm, 10k ios I guess is only 10 rbd chunks.  What replication level
>>> are you using?  Try setting them to 10000000 (you only need to set the
>>> xfs ones).
>>>
>>> For the rand test, try increasing
>>> filestore_wbthrottle_xfs_inodes_hard_limit and
>>> filestore_wbthrottle_xfs_inodes_start_flusher to 10000 as well as
>>> setting the above ios limits.
>>
>> Ok, my current config:
>>         filestore wbthrottle xfs ios start flusher = 10000000
>>         filestore wbthrottle xfs ios hard limit = 10000000
>>         filestore wbthrottle xfs inodes hard limit = 10000
>>         filestore wbthrottle xfs inodes start flusher = 10000
>>
>> Unfortunately, that still makes no difference at all in the original
>> standard-tests.
>>
>> Random IO on Dumpling, after 120 secs of runtime:
>> # rbd bench-write test --io-pattern=rand
>> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
>>   SEC       OPS   OPS/SEC   BYTES/SEC
>>     1       545    534.98  1515804.02
>>     2      1162    580.80  1662416.60
>>     3      1731    576.52  1662966.61
>>     4      2317    579.04  1695129.94
>>     5      2817    562.56  1672754.87
>> <...>
>>   120     43564    362.91  1080512.00
>>   121     43774    361.76  1077368.28
>>   122     44419    364.06  1083894.31
>>   123     45046    366.22  1090518.68
>>   124     45287    364.01  1084437.37
>>   125     45334    361.54  1077035.12
>>   126     45336    359.40  1070678.36
>>   127     45797    360.60  1073985.78
>>   128     46388    362.40  1080056.75
>>   129     46984    364.21  1086068.63
>>   130     47604    366.11  1092712.51
>>
>> Random IO on Cuttlefish, after 120 secs of runtime:
>> rbd bench-write test --io-pattern=rand
>> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
>>   SEC       OPS   OPS/SEC   BYTES/SEC
>>     1      1066   1065.54  3115713.13
>>     2      2099   1049.31  2936300.53
>>     3      3218   1072.32  3028707.50
>>     4      4026   1003.23  2807859.15
>>     5      4272    793.80  2226962.63
>> <...>
>>   120     66935    557.79  1612483.74
>>   121     68011    562.01  1625419.34
>>   122     68428    558.59  1615376.62
>>   123     68579    557.06  1610780.38
>>   125     68777    549.73  1589816.94
>>   126     69745    553.52  1601671.46
>>   127     70855    557.91  1614293.12
>>   128     71962    562.20  1627070.81
>>   129     72529    562.22  1627120.59
>>   130     73146    562.66  1628818.79
>>
>> Confirming your setting took properly:
>> # ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
>> wbthrottle
>>   "filestore_wbthrottle_btrfs_bytes_start_flusher": "41943040",
>>   "filestore_wbthrottle_btrfs_bytes_hard_limit": "419430400",
>>   "filestore_wbthrottle_btrfs_ios_start_flusher": "500",
>>   "filestore_wbthrottle_btrfs_ios_hard_limit": "5000",
>>   "filestore_wbthrottle_btrfs_inodes_start_flusher": "500",
>>   "filestore_wbthrottle_xfs_bytes_start_flusher": "41943040",
>>   "filestore_wbthrottle_xfs_bytes_hard_limit": "419430400",
>>   "filestore_wbthrottle_xfs_ios_start_flusher": "10000000",
>>   "filestore_wbthrottle_xfs_ios_hard_limit": "10000000",
>>   "filestore_wbthrottle_xfs_inodes_start_flusher": "10000",
>>   "filestore_wbthrottle_btrfs_inodes_hard_limit": "5000",
>>   "filestore_wbthrottle_xfs_inodes_hard_limit": "10000",
>>
>> As you can see, your suggested settings made little difference anywhere
>> after you let it settle and in the end, there is still a huge difference
>> in performance between Cuttlefish and Dumpling.  Caching seems to mask
>> the problem, though, at least with the type of IO generated by these
>> tests.  It also reduces CPU-usage by the OSDs to almost none.
>>
>>
>>    Regards,
>>
>>       Oliver
>>
>>> -Sam
>>>
>>> On Wed, Aug 21, 2013 at 8:13 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>>> Hey Samuel,
>>>>
>>>> On wo, 2013-08-21 at 18:33 -0700, Samuel Just wrote:
>>>>> I am dumb.  There *has* been a change in the osd which can account for
>>>>> this: the wbthrottle limits.  We added some logic to force the kernel
>>>>> to start flushing writes out earlier, normally a good thing.  In this
>>>>> case, it's probably doing an fsync every 500 writes.
>>>>>
>>>>> Can you run 3 tests?
>>>>> 1) rerun with --rbd-cache (rbd bench-write test --rbd-cache)
>>>>
>>>> That flies, with very little load on the OSDs.  That load is much higher
>>>> without caching.  This test was on Dumpling.  You get the idea.
>>>>
>>>> # rbd bench-write test --rbd-cache
>>>> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
>>>>   SEC       OPS   OPS/SEC   BYTES/SEC
>>>>     1     13832  13817.17  3678357.12
>>>>     2     27590  13793.28  3747373.45
>>>>     3     41144  13708.02  3781505.90
>>>>     4     54510  13622.96  3744543.19
>>>>     5     67249  13448.05  3714589.10
>>>>     6     78778  13128.73  3632892.25
>>>>     7     89507  12783.12  3544970.08
>>>>     8    103171  12895.06  3564180.90
>>>>     9    116306  12921.45  3568128.57
>>>>    10    128946  12894.02  3557625.15
>>>>
>>>>> 2) rerun without rbd-cache, but add the following configs to the osds:
>>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000
>>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000
>>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000
>>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000
>>>>
>>>> That didn't seem to change the results in any way on Dumpling.  I'm
>>>> running the OSD-storage on XFS, BTW.
>>>>
>>>>> 3) rerun with the random pattern (rbd bench-write test --io-pattern=rand)
>>>>
>>>> With Dumpling, it starts out high, then drops off:
>>>>
>>>> # rbd bench-write test --io-pattern=rand
>>>> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
>>>>   SEC       OPS   OPS/SEC   BYTES/SEC
>>>>     1       584    569.09  1700360.17
>>>>     2      1165    582.37  1687174.07
>>>>     3      1568    468.14  1364737.79
>>>>     4      1649    411.79  1197760.41
>>>>     5      1819    363.30  1061872.78
>>>>     6      2369    394.17  1146998.37
>>>>     7      2796    395.00  1153270.94
>>>>     8      2939    360.59  1050822.48
>>>>     9      3221    357.87  1040777.01
>>>>    10      3515    339.64  991027.17
>>>>    11      3834    347.46  1021930.90
>>>>    12      4206    350.18  1024435.83
>>>>    13      4424    337.30  981849.71
>>>>    14      4493    320.86  936308.80
>>>>    15      4734    315.29  921501.51
>>>>    16      5285    330.31  969201.95
>>>>    17      5436    319.68  936765.93
>>>>    18      6011    333.91  980649.27
>>>>    19      6376    334.35  983297.54
>>>>    20      6398    311.90  917323.37
>>>>    21      6602    314.36  926410.86
>>>>    22      7167    325.67  962430.15
>>>>    24      7543    310.73  917734.40
>>>>    25      7704    307.98  909611.56
>>>>    26      8279    318.30  938557.27
>>>>    27      8826    323.99  953434.67
>>>>    28      9296    331.87  977987.40
>>>>    29      9476    322.55  951696.77
>>>>    30      9628    320.92  946692.41
>>>>
>>>> Same for Cuttlefish, but with significantly more performance over the
>>>> entire run:
>>>>
>>>> # rbd bench-write test --io-pattern=rand
>>>> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
>>>>   SEC       OPS   OPS/SEC   BYTES/SEC
>>>>     1       920    919.34  2464017.91
>>>>     2      1228    575.73  1597724.10
>>>>     3      2016    669.87  1835987.05
>>>>     4      2222    528.39  1443494.02
>>>>     5      2339    467.32  1268464.39
>>>>     6      3341    556.76  1532374.18
>>>>     7      3677    500.11  1385500.24
>>>>     8      3925    490.60  1361346.40
>>>>     9      4633    510.17  1430694.07
>>>>    10      5275    506.04  1412213.94
>>>>    11      5713    518.16  1449597.21
>>>>    12      6279    523.22  1466611.93
>>>>    13      7224    529.66  1485373.17
>>>>    14      7292    520.08  1461556.30
>>>>    15      8007    533.80  1498859.94
>>>>    16      8218    510.93  1436516.13
>>>>    17      8707    511.54  1439508.19
>>>>    18      9699    538.56  1513392.51
>>>>    19     10264    539.98  1522426.08
>>>>    20     10684    530.23  1496947.12
>>>>    21     10999    523.54  1478991.54
>>>>    22     11357    512.25  1450630.22
>>>>    23     12079    520.24  1475359.56
>>>>    24     12868    530.69  1504088.36
>>>>    25     13315    529.92  1504462.99
>>>>    26     13647    512.39  1451005.94
>>>>    27     14002    517.90  1467738.55
>>>>    28     14343    504.96  1430645.32
>>>>    29     14575    501.53  1419589.36
>>>>    30     15408    513.55  1452033.93
>>>>
>>>> Weird that the config-options didn't change anything.  I had hoped to
>>>> try them on the production-cluster.  Anything I could try there?  Had to
>>>> revert the OSDs there to Cuttlefish again for now, as Dumpling is just
>>>> too slow.
>>>>
>>>>
>>>>    Regards,
>>>>
>>>>       Oliver
>>>>>
>>>>> I expect 1) to largely eliminate the difference (these are sequential
>>>>> 4k ios and should be coalesced into much bigger writes).  2) should
>>>>> basically restore the same filestore behavior as in cuttlefish.  3)
>>>>> should also show little difference since the io will be random.
>>>>>
>>>>> In normal use, your vms should probably be using rbd caching anyway
>>>>> (in addition to the write coalescing already occurring in the guest
>>>>> os), so this may not be what you are seeing on your cluster.
>>>>> -Sam
>>>>>
>>>>> On Wed, Aug 21, 2013 at 6:00 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
>>>>>> You'd need to unmount the fs to actually clear the cache.  Did you see
>>>>>> a significant difference in load between the runs?  To confirm, the
>>>>>> rbd client is dumpling the entire time?
>>>>>> -Sam
>>>>>>
>>>>>> On Wed, Aug 21, 2013 at 2:28 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>>>>>> Hey Samuel,
>>>>>>>
>>>>>>> I repeated the same test several times before my post and just now 2
>>>>>>> more times.  It holds up and is also repeatable in reverse order, with
>>>>>>> the same results.  Remember, I restart all OSDs between tests, so any
>>>>>>> caches should get destroyed and besides, I'm writing.  That shouldn't
>>>>>>> involve very large-scale caching, right?  I also waited for the cluster
>>>>>>> to come to a healthy state after restarting the OSDs, so it's not
>>>>>>> related to rebalancing or peering-activity, either.
>>>>>>>
>>>>>>>
>>>>>>>    Regards,
>>>>>>>
>>>>>>>       Oliver
>>>>>>>
>>>>>>> On wo, 2013-08-21 at 14:07 -0700, Samuel Just wrote:
>>>>>>>> Try it again in the reverse order, I strongly suspect caching effects.
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Wed, Aug 21, 2013 at 1:34 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>>>>>>>> Hey Samuel,
>>>>>>>>>
>>>>>>>>> Finally got it reproduced on my test-cluster, which was otherwise
>>>>>>>>> unloaded at the time.  First, with Dumpling:
>>>>>>>>>
>>>>>>>>> # rbd create --size 102400 test
>>>>>>>>> # ceph-osd --version
>>>>>>>>> ceph version 0.67.1-6-g0c4f2f3
>>>>>>>>> (0c4f2f34b78b634efe7f4d56694e2edeeda5a130)
>>>>>>>>> # rbd bench-write test
>>>>>>>>> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
>>>>>>>>>   SEC       OPS   OPS/SEC   BYTES/SEC
>>>>>>>>>     1       204    200.67  592272.58
>>>>>>>>>     2       401    200.24  607462.05
>>>>>>>>>     3       625    204.62  639655.53
>>>>>>>>>     4       824    205.20  636488.76
>>>>>>>>>     5      1026    204.08  630587.77
>>>>>>>>>     6      1217    202.69  628965.39
>>>>>>>>>     7      1429    203.19  633671.84
>>>>>>>>>     8      1626    202.93  635419.90
>>>>>>>>>     9      1837    203.11  630421.40
>>>>>>>>>    10      2045    204.16  631364.11
>>>>>>>>>    11      2247    204.12  625833.91
>>>>>>>>>    12      2456    204.29  629952.57
>>>>>>>>>    13      2655    204.15  634639.37
>>>>>>>>>    14      2888    206.15  637373.53
>>>>>>>>>    15      3115    207.12  640826.18
>>>>>>>>>    16      3330    207.75  640120.38
>>>>>>>>>    17      3529    207.57  636277.18
>>>>>>>>>    18      3750    208.24  635036.47
>>>>>>>>>    19      3965    208.65  631118.13
>>>>>>>>>    20      4216    210.75  638197.93
>>>>>>>>>    21      4534    215.83  656890.46
>>>>>>>>>    22      4809    218.50  666612.75
>>>>>>>>>    23      5076    220.63  678678.36
>>>>>>>>>    24      5320    221.39  686760.90
>>>>>>>>>    25      5651    226.02  701345.26
>>>>>>>>>    26      5967    229.44  706839.16
>>>>>>>>>    27      6271    232.05  707958.62
>>>>>>>>>    28      6522    232.82  709890.37
>>>>>>>>>    29      6773    233.37  711718.21
>>>>>>>>>    30      7151    238.23  722924.18
>>>>>>>>>
>>>>>>>>> Then I replaced the OSD with Cuttlefish on all nodes and restarted it
>>>>>>>>> everywhere.
>>>>>>>>>
>>>>>>>>> # ceph-osd --version
>>>>>>>>> ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
>>>>>>>>> # rbd bench-write test
>>>>>>>>> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
>>>>>>>>>   SEC       OPS   OPS/SEC   BYTES/SEC
>>>>>>>>>     1       556    555.63  1915665.58
>>>>>>>>>     2      1095    545.08  1839139.42
>>>>>>>>>     3      1696    564.45  1840316.22
>>>>>>>>>     4      2312    577.25  1832641.60
>>>>>>>>>     5      2870    573.46  1800558.64
>>>>>>>>>     6      3461    576.81  1780993.33
>>>>>>>>>     7      4062    579.26  1821246.36
>>>>>>>>>     8      4678    581.51  1862003.10
>>>>>>>>>     9      5269    585.12  1902219.78
>>>>>>>>>    10      5883    587.11  1914700.43
>>>>>>>>>    11      6471    587.59  1913208.96
>>>>>>>>>    12      7022    585.12  1914375.79
>>>>>>>>>    13      7568    580.15  1899650.34
>>>>>>>>>    14      8122    579.87  1864572.53
>>>>>>>>>    15      8659    576.18  1826927.83
>>>>>>>>>    16      9259    576.81  1796653.40
>>>>>>>>>    17      9826    577.60  1785328.57
>>>>>>>>>    18     10401    576.30  1777479.51
>>>>>>>>>    19     10962    576.94  1777431.02
>>>>>>>>>    20     11399    569.84  1756633.76
>>>>>>>>>    21     11937    567.45  1740147.43
>>>>>>>>>    22     12491    567.47  1726288.46
>>>>>>>>>    23     13109    569.92  1727347.20
>>>>>>>>>    24     13629    567.64  1721996.79
>>>>>>>>>    25     14153    566.11  1720782.06
>>>>>>>>>    26     14669    563.23  1716275.34
>>>>>>>>>    27     15191    562.61  1717677.74
>>>>>>>>>    28     15768    562.49  1713059.35
>>>>>>>>>    29     16281    561.40  1707289.23
>>>>>>>>>    31     16719    536.96  1628452.43
>>>>>>>>> ^C
>>>>>>>>>
>>>>>>>>> As you can see, the difference is quite dramatic and this confirms it's
>>>>>>>>> probably something RBD-related.  The test-cluster is a standard 3-node
>>>>>>>>> cluster with 3xMON, 3xMDS, 3xOSD, tied together with 2x1Gbit/s ethernet.
>>>>>>>>> I replaced only the OSD-binary for these tests.  The rest of the
>>>>>>>>> Ceph-installation is at Dumpling.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    Regards,
>>>>>>>>>
>>>>>>>>>      Oliver
>>>>>>>>>
>>>>>>>>> On wo, 2013-08-21 at 12:05 -0700, Samuel Just wrote:
>>>>>>>>>> There haven't been any significant osd side changes that I can think
>>>>>>>>>> of.  Is cpu usage still high?  If so, can you post the profiler
>>>>>>>>>> results again?
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 21, 2013 at 12:02 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>>>>>>>>>> Hey Samuel,
>>>>>>>>>>>
>>>>>>>>>>> I had a good run on the production-cluster with it and unfortunately, it
>>>>>>>>>>> still doesn't seem to have solved the problem.  It seemed OK for a while
>>>>>>>>>>> and individual OSD CPU-usage seemed quite low, but as the cluster's load
>>>>>>>>>>> increased during the day, things got slower again.  Write-performance
>>>>>>>>>>> within a VM crawled to 30MB/sec and at some point, I got only 10MB/sec
>>>>>>>>>>> on reads in that same VM.  I also did RADOS bench-tests with `rados
>>>>>>>>>>> --pool rbd bench 120 write' and those got several hundreds of MB's/sec
>>>>>>>>>>> on the same cluster at the same time of day, so maybe the problem is
>>>>>>>>>>> RBD-related.  Is there any code in the OSD that could influence
>>>>>>>>>>> RBD-performance alone?  Do you know of any other significant changes to
>>>>>>>>>>> the OSD between Cuttlefish and Dumpling that could result in this?
>>>>>>>>>>>
>>>>>>>>>>> PS: I also did the same RADOS bench-tests on my test-cluster, both with
>>>>>>>>>>> Cuttlefish and Dumpling without your fix and got almost identical
>>>>>>>>>>> results.  This confirms that the problem might be in RBD, as Mark suggested.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>     Regards,
>>>>>>>>>>>
>>>>>>>>>>>        Oliver
>>>>>>>>>>>
>>>>>>>>>>> On 20-08-13 19:40, Samuel Just wrote:
>>>>>>>>>>>> Can you try dumpling head without the option?
>>>>>>>>>>>> -Sam
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 20, 2013 at 1:44 AM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>>>>>>>>>>>> Hey Mark,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry, but after some more tests I have to report that it only worked
>>>>>>>>>>>>> partly.  The load seems lower with "wip-dumpling-pglog-undirty" in
>>>>>>>>>>>>> place, but the Cuttlefish-osd still seems significantly faster and even
>>>>>>>>>>>>> with "wip-dumpling-pglog-undirty" in place, things slow down way too
>>>>>>>>>>>>> much under load.  Unfortunately, only my production-cluster seems busy
>>>>>>>>>>>>> enough to actually show the problem clearly by slowing down.  Below is
>>>>>>>>>>>>> `perf top'-output, fresh from my production-cluster under it's regular
>>>>>>>>>>>>> load:
>>>>>>>>>>>>>
>>>>>>>>>>>>> First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout =
>>>>>>>>>>>>> false":
>>>>>>>>>>>>>  16.53%  [kernel]                     [k]
>>>>>>>>>>>>> intel_idle
>>>>>>>>>>>>>   6.47%  libleveldb.so.1.9            [.]
>>>>>>>>>>>>> 0x380a1
>>>>>>>>>>>>>   5.76%  [kernel]                     [k]
>>>>>>>>>>>>> find_busiest_group
>>>>>>>>>>>>>   4.11%  libc-2.11.3.so               [.]
>>>>>>>>>>>>> memcmp
>>>>>>>>>>>>>   3.95%  kvm                          [.]
>>>>>>>>>>>>> 0x1f6f31
>>>>>>>>>>>>>   2.05%  [kernel]                     [k]
>>>>>>>>>>>>> default_send_IPI_mask_sequence_phys
>>>>>>>>>>>>>   2.03%  [kernel]                     [k]
>>>>>>>>>>>>> _raw_spin_lock
>>>>>>>>>>>>>   1.87%  libleveldb.so.1.9            [.]
>>>>>>>>>>>>> leveldb::InternalKeyComparator::Compar
>>>>>>>>>>>>>   1.57%  libc-2.11.3.so               [.]
>>>>>>>>>>>>> memcpy
>>>>>>>>>>>>>   1.37%  libleveldb.so.1.9            [.]
>>>>>>>>>>>>> leveldb::Block::Iter::Next()
>>>>>>>>>>>>>   1.26%  [kernel]                     [k]
>>>>>>>>>>>>> hrtimer_interrupt
>>>>>>>>>>>>>   1.12%  [kernel]                     [k]
>>>>>>>>>>>>> __hrtimer_start_range_ns
>>>>>>>>>>>>>   1.09%  [kernel]                     [k]
>>>>>>>>>>>>> native_write_cr0
>>>>>>>>>>>>>   1.05%  libstdc++.so.6.0.13          [.]
>>>>>>>>>>>>> std::string::_M_mutate(unsigned long,
>>>>>>>>>>>>>   1.00%  [kernel]                     [k]
>>>>>>>>>>>>> native_write_msr_safe
>>>>>>>>>>>>>   0.99%  [kernel]                     [k]
>>>>>>>>>>>>> apic_timer_interrupt
>>>>>>>>>>>>>   0.98%  [kernel]                     [k]
>>>>>>>>>>>>> clockevents_program_event
>>>>>>>>>>>>>   0.96%  [kernel]                     [k]
>>>>>>>>>>>>> _raw_spin_unlock_irqrestore
>>>>>>>>>>>>>   0.95%  ceph-osd                     [.]
>>>>>>>>>>>>> PGLog::undirty()
>>>>>>>>>>>>>   0.92%  [kernel]                     [k]
>>>>>>>>>>>>> find_next_bit
>>>>>>>>>>>>>   0.91%  libsnappy.so.1.1.2           [.]
>>>>>>>>>>>>> snappy::RawUncompress(snappy::Source*,
>>>>>>>>>>>>>   0.88%  [kernel]                     [k]
>>>>>>>>>>>>> __schedule
>>>>>>>>>>>>>   0.87%  [kernel]                     [k]
>>>>>>>>>>>>> cpumask_next_and
>>>>>>>>>>>>>   0.84%  [kernel]                     [k]
>>>>>>>>>>>>> do_select
>>>>>>>>>>>>>   0.80%  [kernel]                     [k]
>>>>>>>>>>>>> fget_light
>>>>>>>>>>>>>   0.77%  [kernel]                     [k]
>>>>>>>>>>>>> reschedule_interrupt
>>>>>>>>>>>>>   0.75%  [kernel]                     [k]
>>>>>>>>>>>>> _raw_spin_lock_irqsave
>>>>>>>>>>>>>   0.62%  libstdc++.so.6.0.13          [.] std::string::append(char
>>>>>>>>>>>>> const*, unsig
>>>>>>>>>>>>>   0.59%  [kvm_intel]                  [k]
>>>>>>>>>>>>> vmx_vcpu_run
>>>>>>>>>>>>>   0.58%  [kernel]                     [k]
>>>>>>>>>>>>> copy_user_generic_string
>>>>>>>>>>>>>   0.56%  [kernel]                     [k]
>>>>>>>>>>>>> load_balance
>>>>>>>>>>>>>   0.54%  [kernel]                     [k]
>>>>>>>>>>>>> tg_load_down
>>>>>>>>>>>>>   0.53%  libpthread-2.11.3.so         [.]
>>>>>>>>>>>>> pthread_mutex_lock
>>>>>>>>>>>>>   0.52%  [kernel]                     [k] sync_inodes_sb
>>>>>>>>>>>>>
>>>>>>>>>>>>> Second, the 0.61.8 osd, under identical load:
>>>>>>>>>>>>>  21.51%  [kernel]                     [k]
>>>>>>>>>>>>> intel_idle
>>>>>>>>>>>>>   6.66%  [kernel]                     [k]
>>>>>>>>>>>>> find_busiest_group
>>>>>>>>>>>>>   6.25%  kvm                          [.]
>>>>>>>>>>>>> 0x2d214b
>>>>>>>>>>>>>   1.97%  [kernel]                     [k]
>>>>>>>>>>>>> _raw_spin_lock
>>>>>>>>>>>>>   1.47%  [kernel]                     [k]
>>>>>>>>>>>>> native_write_msr_safe
>>>>>>>>>>>>>   1.44%  [kernel]                     [k]
>>>>>>>>>>>>> hrtimer_interrupt
>>>>>>>>>>>>>   1.37%  [kernel]                     [k]
>>>>>>>>>>>>> __hrtimer_start_range_ns
>>>>>>>>>>>>>   1.34%  [kernel]                     [k]
>>>>>>>>>>>>> do_select
>>>>>>>>>>>>>   1.29%  [kernel]                     [k]
>>>>>>>>>>>>> fget_light
>>>>>>>>>>>>>   1.24%  [kernel]                     [k]
>>>>>>>>>>>>> clockevents_program_event
>>>>>>>>>>>>>   1.21%  [kernel]                     [k]
>>>>>>>>>>>>> default_send_IPI_mask_sequence_phys
>>>>>>>>>>>>>   1.18%  [kernel]                     [k]
>>>>>>>>>>>>> cpumask_next_and
>>>>>>>>>>>>>   1.18%  [kernel]                     [k]
>>>>>>>>>>>>> _raw_spin_unlock_irqrestore
>>>>>>>>>>>>>   1.15%  [kernel]                     [k]
>>>>>>>>>>>>> find_next_bit
>>>>>>>>>>>>>   1.14%  [kernel]                     [k]
>>>>>>>>>>>>> __schedule
>>>>>>>>>>>>>   1.11%  [kernel]                     [k]
>>>>>>>>>>>>> _raw_spin_lock_irqsave
>>>>>>>>>>>>>   0.98%  [kernel]                     [k]
>>>>>>>>>>>>> apic_timer_interrupt
>>>>>>>>>>>>>   0.86%  [kernel]                     [k]
>>>>>>>>>>>>> copy_user_generic_string
>>>>>>>>>>>>>   0.83%  [kernel]                     [k]
>>>>>>>>>>>>> native_write_cr0
>>>>>>>>>>>>>   0.76%  [kernel]                     [k]
>>>>>>>>>>>>> sync_inodes_sb
>>>>>>>>>>>>>   0.71%  [kernel]                     [k]
>>>>>>>>>>>>> rcu_needs_cpu
>>>>>>>>>>>>>   0.69%  libpthread-2.11.3.so         [.]
>>>>>>>>>>>>> pthread_mutex_lock
>>>>>>>>>>>>>   0.66%  [kernel]                     [k]
>>>>>>>>>>>>> fput
>>>>>>>>>>>>>   0.62%  [kernel]                     [k]
>>>>>>>>>>>>> load_balance
>>>>>>>>>>>>>   0.57%  [vdso]                       [.]
>>>>>>>>>>>>> 0x7fff3a976700
>>>>>>>>>>>>>   0.56%  libc-2.11.3.so               [.]
>>>>>>>>>>>>> memcpy
>>>>>>>>>>>>>   0.56%  [kernel]                     [k]
>>>>>>>>>>>>> reschedule_interrupt
>>>>>>>>>>>>>   0.56%  [kernel]                     [k]
>>>>>>>>>>>>> tg_load_down
>>>>>>>>>>>>>   0.50%  [kernel]                     [k] iput
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see lots of new differences, but again don't know what to make of it
>>>>>>>>>>>>> and what might be related or significant.  LevelDB seems to jump out
>>>>>>>>>>>>> this time, amongst others.  Let me know if you need more info.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>>      Oliver
>>>>>>>>>>>>>
>>>>>>>>>>>>> On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote:
>>>>>>>>>>>>>> Hi Oliver,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Glad that helped!  How much more efficient do the cuttlefish OSDs seem
>>>>>>>>>>>>>> at this point (with wip-dumpling-pglog-undirty)?  On modern Intel
>>>>>>>>>>>>>> platforms we were actually hoping to see CPU usage go down in many cases
>>>>>>>>>>>>>> due to the use of hardware CRC32 instructions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 08/19/2013 03:06 PM, Oliver Daudey wrote:
>>>>>>>>>>>>>>> Hey Samuel,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!  I installed your version, repeated the same tests on my
>>>>>>>>>>>>>>> test-cluster and the extra CPU-loading seems to have disappeared.  Then
>>>>>>>>>>>>>>> I replaced one osd of my production-cluster with your modified version
>>>>>>>>>>>>>>> and it's config-option and it seems to be a lot less CPU-hungry now.
>>>>>>>>>>>>>>> Although the Cuttlefish-osds still seem to be even more CPU-efficient,
>>>>>>>>>>>>>>> your changes have definitely helped a lot.  We seem to be looking in the
>>>>>>>>>>>>>>> right direction, at least for this part of the problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW, I ran `perf top' on the production-node with your modified osd and
>>>>>>>>>>>>>>> didn't see anything osd-related stand out on top.  "PGLog::undirty()"
>>>>>>>>>>>>>>> was in there, but with much lower usage, right at the bottom of the
>>>>>>>>>>>>>>> green part of the output.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Many thanks for your help so far!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     Regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>       Oliver
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
>>>>>>>>>>>>>>>> You're right, PGLog::undirty() looks suspicious.  I just pushed a
>>>>>>>>>>>>>>>> branch wip-dumpling-pglog-undirty with a new config
>>>>>>>>>>>>>>>> (osd_debug_pg_log_writeout) which if set to false will disable some
>>>>>>>>>>>>>>>> strictly debugging checks which occur in PGLog::undirty().  We haven't
>>>>>>>>>>>>>>>> actually seen these checks causing excessive cpu usage, so this may be
>>>>>>>>>>>>>>>> a red herring.
>>>>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>>>>>>>>>>>>>>>> Hey Mark,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
>>>>>>>>>>>>>>>>>> On 08/17/2013 06:13 AM, Oliver Daudey wrote:
>>>>>>>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
>>>>>>>>>>>>>>>>>>> created in the tracker.  Thought I would pass it through the list as
>>>>>>>>>>>>>>>>>>> well, to get an idea if anyone else is running into it.  It may only
>>>>>>>>>>>>>>>>>>> show under higher loads.  More info about my setup is in the bug-report
>>>>>>>>>>>>>>>>>>> above.  Here goes:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
>>>>>>>>>>>>>>>>>>> and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
>>>>>>>>>>>>>>>>>>> unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
>>>>>>>>>>>>>>>>>>> +MB/sec on simple linear writes to a file with `dd' inside a VM on this
>>>>>>>>>>>>>>>>>>> cluster under regular load and the osds usually averaged 20-100%
>>>>>>>>>>>>>>>>>>> CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
>>>>>>>>>>>>>>>>>>> the osds shot up to 100% to 400% in `top' (multi-core system) and the
>>>>>>>>>>>>>>>>>>> speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
>>>>>>>>>>>>>>>>>>> complained that disk-access inside the VMs was significantly slower and
>>>>>>>>>>>>>>>>>>> the backups of the RBD-store I was running, also got behind quickly.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
>>>>>>>>>>>>>>>>>>> rest at 0.67 Dumpling, speed and load returned to normal. I have
>>>>>>>>>>>>>>>>>>> repeated this performance-hit upon upgrade on a similar test-cluster
>>>>>>>>>>>>>>>>>>> under no additional load at all. Although CPU-usage for the osds wasn't
>>>>>>>>>>>>>>>>>>> as dramatic during these tests because there was no base-load from other
>>>>>>>>>>>>>>>>>>> VMs, I/O-performance dropped significantly after upgrading during these
>>>>>>>>>>>>>>>>>>> tests as well, and returned to normal after downgrading the osds.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm not sure what to make of it. There are no visible errors in the logs
>>>>>>>>>>>>>>>>>>> and everything runs and reports good health, it's just a lot slower,
>>>>>>>>>>>>>>>>>>> with a lot more CPU-usage.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Oliver,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you have access to the perf command on this system, could you try
>>>>>>>>>>>>>>>>>> running:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "sudo perf top"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And if that doesn't give you much,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "sudo perf record -g"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> then:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "sudo perf report | less"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> during the period of high CPU usage?  This will give you a call graph.
>>>>>>>>>>>>>>>>>> There may be symbols missing, but it might help track down what the OSDs
>>>>>>>>>>>>>>>>>> are doing.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for your help!  I did a couple of runs on my test-cluster,
>>>>>>>>>>>>>>>>> loading it with writes from 3 VMs concurrently and measuring the results
>>>>>>>>>>>>>>>>> at the first node with all 0.67 Dumpling-components and with the osds
>>>>>>>>>>>>>>>>> replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
>>>>>>>>>>>>>>>>> while, then copied anything that showed in red and green into this post.
>>>>>>>>>>>>>>>>> Here are the results (sorry for the word-wraps):
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> First, with 0.61.7 osds:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   19.91%  [kernel]                    [k] intel_idle
>>>>>>>>>>>>>>>>>   10.18%  [kernel]                    [k] _raw_spin_lock_irqsave
>>>>>>>>>>>>>>>>>    6.79%  ceph-osd                    [.] ceph_crc32c_le
>>>>>>>>>>>>>>>>>    4.93%  [kernel]                    [k]
>>>>>>>>>>>>>>>>> default_send_IPI_mask_sequence_phys
>>>>>>>>>>>>>>>>>    2.71%  [kernel]                    [k] copy_user_generic_string
>>>>>>>>>>>>>>>>>    1.42%  libc-2.11.3.so              [.] memcpy
>>>>>>>>>>>>>>>>>    1.23%  [kernel]                    [k] find_busiest_group
>>>>>>>>>>>>>>>>>    1.13%  librados.so.2.0.0           [.] ceph_crc32c_le_intel
>>>>>>>>>>>>>>>>>    1.11%  [kernel]                    [k] _raw_spin_lock
>>>>>>>>>>>>>>>>>    0.99%  kvm                         [.] 0x1931f8
>>>>>>>>>>>>>>>>>    0.92%  [igb]                       [k] igb_poll
>>>>>>>>>>>>>>>>>    0.87%  [kernel]                    [k] native_write_cr0
>>>>>>>>>>>>>>>>>    0.80%  [kernel]                    [k] csum_partial
>>>>>>>>>>>>>>>>>    0.78%  [kernel]                    [k] __do_softirq
>>>>>>>>>>>>>>>>>    0.63%  [kernel]                    [k] hpet_legacy_next_event
>>>>>>>>>>>>>>>>>    0.53%  [ip_tables]                 [k] ipt_do_table
>>>>>>>>>>>>>>>>>    0.50%  libc-2.11.3.so              [.] 0x74433
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Second test, with 0.67 osds:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   18.32%  [kernel]                      [k] intel_idle
>>>>>>>>>>>>>>>>>    7.58%  [kernel]                      [k] _raw_spin_lock_irqsave
>>>>>>>>>>>>>>>>>    7.04%  ceph-osd                      [.] PGLog::undirty()
>>>>>>>>>>>>>>>>>    4.39%  ceph-osd                      [.] ceph_crc32c_le_intel
>>>>>>>>>>>>>>>>>    3.92%  [kernel]                      [k]
>>>>>>>>>>>>>>>>> default_send_IPI_mask_sequence_phys
>>>>>>>>>>>>>>>>>    2.25%  [kernel]                      [k] copy_user_generic_string
>>>>>>>>>>>>>>>>>    1.76%  libc-2.11.3.so                [.] memcpy
>>>>>>>>>>>>>>>>>    1.56%  librados.so.2.0.0             [.] ceph_crc32c_le_intel
>>>>>>>>>>>>>>>>>    1.40%  libc-2.11.3.so                [.] vfprintf
>>>>>>>>>>>>>>>>>    1.12%  libc-2.11.3.so                [.] 0x7217b
>>>>>>>>>>>>>>>>>    1.05%  [kernel]                      [k] _raw_spin_lock
>>>>>>>>>>>>>>>>>    1.01%  [kernel]                      [k] find_busiest_group
>>>>>>>>>>>>>>>>>    0.83%  kvm                           [.] 0x193ab8
>>>>>>>>>>>>>>>>>    0.80%  [kernel]                      [k] native_write_cr0
>>>>>>>>>>>>>>>>>    0.76%  [kernel]                      [k] __do_softirq
>>>>>>>>>>>>>>>>>    0.73%  libc-2.11.3.so                [.] _IO_default_xsputn
>>>>>>>>>>>>>>>>>    0.70%  [kernel]                      [k] csum_partial
>>>>>>>>>>>>>>>>>    0.68%  [igb]                         [k] igb_poll
>>>>>>>>>>>>>>>>>    0.58%  [kernel]                      [k] hpet_legacy_next_event
>>>>>>>>>>>>>>>>>    0.54%  [kernel]                      [k] __schedule
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> What jumps right out, is the "PGLog::undirty()", which doesn't show up
>>>>>>>>>>>>>>>>> with 0.61.7 at all, but is an extra drag right at top-usage in 0.67.
>>>>>>>>>>>>>>>>> Note that I didn't manage to fully load the test-cluster CPU-wise,
>>>>>>>>>>>>>>>>> because of network-constraints and I don't want to take any extra risks
>>>>>>>>>>>>>>>>> on the production-cluster and test it there, but it seems we found a
>>>>>>>>>>>>>>>>> possible culprit.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Any ideas?  Thanks again!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     Regards,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        Oliver
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com