Re: Significant slowdown of osds since v0.67 Dumpling

Samuel Just <sam.just@xxxxxxxxxxx> · Wed, 21 Aug 2013 12:05:55 -0700



There haven't been any significant osd side changes that I can think
of.  Is cpu usage still high?  If so, can you post the profiler
results again?
-Sam

On Wed, Aug 21, 2013 at 12:02 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
> Hey Samuel,
>
> I had a good run on the production-cluster with it and unfortunately, it
> still doesn't seem to have solved the problem.  It seemed OK for a while
> and individual OSD CPU-usage seemed quite low, but as the cluster's load
> increased during the day, things got slower again.  Write-performance
> within a VM crawled to 30MB/sec and at some point, I got only 10MB/sec
> on reads in that same VM.  I also did RADOS bench-tests with `rados
> --pool rbd bench 120 write' and those got several hundreds of MB's/sec
> on the same cluster at the same time of day, so maybe the problem is
> RBD-related.  Is there any code in the OSD that could influence
> RBD-performance alone?  Do you know of any other significant changes to
> the OSD between Cuttlefish and Dumpling that could result in this?
>
> PS: I also did the same RADOS bench-tests on my test-cluster, both with
> Cuttlefish and Dumpling without your fix and got almost identical
> results.  This confirms that the problem might be in RBD, as Mark suggested.
>
>
>     Regards,
>
>        Oliver
>
> On 20-08-13 19:40, Samuel Just wrote:
>> Can you try dumpling head without the option?
>> -Sam
>>
>> On Tue, Aug 20, 2013 at 1:44 AM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>> Hey Mark,
>>>
>>> Sorry, but after some more tests I have to report that it only worked
>>> partly.  The load seems lower with "wip-dumpling-pglog-undirty" in
>>> place, but the Cuttlefish-osd still seems significantly faster and even
>>> with "wip-dumpling-pglog-undirty" in place, things slow down way too
>>> much under load.  Unfortunately, only my production-cluster seems busy
>>> enough to actually show the problem clearly by slowing down.  Below is
>>> `perf top'-output, fresh from my production-cluster under it's regular
>>> load:
>>>
>>> First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout =
>>> false":
>>>  16.53%  [kernel]                     [k]
>>> intel_idle
>>>   6.47%  libleveldb.so.1.9            [.]
>>> 0x380a1
>>>   5.76%  [kernel]                     [k]
>>> find_busiest_group
>>>   4.11%  libc-2.11.3.so               [.]
>>> memcmp
>>>   3.95%  kvm                          [.]
>>> 0x1f6f31
>>>   2.05%  [kernel]                     [k]
>>> default_send_IPI_mask_sequence_phys
>>>   2.03%  [kernel]                     [k]
>>> _raw_spin_lock
>>>   1.87%  libleveldb.so.1.9            [.]
>>> leveldb::InternalKeyComparator::Compar
>>>   1.57%  libc-2.11.3.so               [.]
>>> memcpy
>>>   1.37%  libleveldb.so.1.9            [.]
>>> leveldb::Block::Iter::Next()
>>>   1.26%  [kernel]                     [k]
>>> hrtimer_interrupt
>>>   1.12%  [kernel]                     [k]
>>> __hrtimer_start_range_ns
>>>   1.09%  [kernel]                     [k]
>>> native_write_cr0
>>>   1.05%  libstdc++.so.6.0.13          [.]
>>> std::string::_M_mutate(unsigned long,
>>>   1.00%  [kernel]                     [k]
>>> native_write_msr_safe
>>>   0.99%  [kernel]                     [k]
>>> apic_timer_interrupt
>>>   0.98%  [kernel]                     [k]
>>> clockevents_program_event
>>>   0.96%  [kernel]                     [k]
>>> _raw_spin_unlock_irqrestore
>>>   0.95%  ceph-osd                     [.]
>>> PGLog::undirty()
>>>   0.92%  [kernel]                     [k]
>>> find_next_bit
>>>   0.91%  libsnappy.so.1.1.2           [.]
>>> snappy::RawUncompress(snappy::Source*,
>>>   0.88%  [kernel]                     [k]
>>> __schedule
>>>   0.87%  [kernel]                     [k]
>>> cpumask_next_and
>>>   0.84%  [kernel]                     [k]
>>> do_select
>>>   0.80%  [kernel]                     [k]
>>> fget_light
>>>   0.77%  [kernel]                     [k]
>>> reschedule_interrupt
>>>   0.75%  [kernel]                     [k]
>>> _raw_spin_lock_irqsave
>>>   0.62%  libstdc++.so.6.0.13          [.] std::string::append(char
>>> const*, unsig
>>>   0.59%  [kvm_intel]                  [k]
>>> vmx_vcpu_run
>>>   0.58%  [kernel]                     [k]
>>> copy_user_generic_string
>>>   0.56%  [kernel]                     [k]
>>> load_balance
>>>   0.54%  [kernel]                     [k]
>>> tg_load_down
>>>   0.53%  libpthread-2.11.3.so         [.]
>>> pthread_mutex_lock
>>>   0.52%  [kernel]                     [k] sync_inodes_sb
>>>
>>> Second, the 0.61.8 osd, under identical load:
>>>  21.51%  [kernel]                     [k]
>>> intel_idle
>>>   6.66%  [kernel]                     [k]
>>> find_busiest_group
>>>   6.25%  kvm                          [.]
>>> 0x2d214b
>>>   1.97%  [kernel]                     [k]
>>> _raw_spin_lock
>>>   1.47%  [kernel]                     [k]
>>> native_write_msr_safe
>>>   1.44%  [kernel]                     [k]
>>> hrtimer_interrupt
>>>   1.37%  [kernel]                     [k]
>>> __hrtimer_start_range_ns
>>>   1.34%  [kernel]                     [k]
>>> do_select
>>>   1.29%  [kernel]                     [k]
>>> fget_light
>>>   1.24%  [kernel]                     [k]
>>> clockevents_program_event
>>>   1.21%  [kernel]                     [k]
>>> default_send_IPI_mask_sequence_phys
>>>   1.18%  [kernel]                     [k]
>>> cpumask_next_and
>>>   1.18%  [kernel]                     [k]
>>> _raw_spin_unlock_irqrestore
>>>   1.15%  [kernel]                     [k]
>>> find_next_bit
>>>   1.14%  [kernel]                     [k]
>>> __schedule
>>>   1.11%  [kernel]                     [k]
>>> _raw_spin_lock_irqsave
>>>   0.98%  [kernel]                     [k]
>>> apic_timer_interrupt
>>>   0.86%  [kernel]                     [k]
>>> copy_user_generic_string
>>>   0.83%  [kernel]                     [k]
>>> native_write_cr0
>>>   0.76%  [kernel]                     [k]
>>> sync_inodes_sb
>>>   0.71%  [kernel]                     [k]
>>> rcu_needs_cpu
>>>   0.69%  libpthread-2.11.3.so         [.]
>>> pthread_mutex_lock
>>>   0.66%  [kernel]                     [k]
>>> fput
>>>   0.62%  [kernel]                     [k]
>>> load_balance
>>>   0.57%  [vdso]                       [.]
>>> 0x7fff3a976700
>>>   0.56%  libc-2.11.3.so               [.]
>>> memcpy
>>>   0.56%  [kernel]                     [k]
>>> reschedule_interrupt
>>>   0.56%  [kernel]                     [k]
>>> tg_load_down
>>>   0.50%  [kernel]                     [k] iput
>>>
>>> I see lots of new differences, but again don't know what to make of it
>>> and what might be related or significant.  LevelDB seems to jump out
>>> this time, amongst others.  Let me know if you need more info.
>>>
>>>
>>>    Regards,
>>>
>>>      Oliver
>>>
>>> On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote:
>>>> Hi Oliver,
>>>>
>>>> Glad that helped!  How much more efficient do the cuttlefish OSDs seem
>>>> at this point (with wip-dumpling-pglog-undirty)?  On modern Intel
>>>> platforms we were actually hoping to see CPU usage go down in many cases
>>>> due to the use of hardware CRC32 instructions.
>>>>
>>>> Mark
>>>>
>>>> On 08/19/2013 03:06 PM, Oliver Daudey wrote:
>>>>> Hey Samuel,
>>>>>
>>>>> Thanks!  I installed your version, repeated the same tests on my
>>>>> test-cluster and the extra CPU-loading seems to have disappeared.  Then
>>>>> I replaced one osd of my production-cluster with your modified version
>>>>> and it's config-option and it seems to be a lot less CPU-hungry now.
>>>>> Although the Cuttlefish-osds still seem to be even more CPU-efficient,
>>>>> your changes have definitely helped a lot.  We seem to be looking in the
>>>>> right direction, at least for this part of the problem.
>>>>>
>>>>> BTW, I ran `perf top' on the production-node with your modified osd and
>>>>> didn't see anything osd-related stand out on top.  "PGLog::undirty()"
>>>>> was in there, but with much lower usage, right at the bottom of the
>>>>> green part of the output.
>>>>>
>>>>> Many thanks for your help so far!
>>>>>
>>>>>
>>>>>     Regards,
>>>>>
>>>>>       Oliver
>>>>>
>>>>> On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
>>>>>> You're right, PGLog::undirty() looks suspicious.  I just pushed a
>>>>>> branch wip-dumpling-pglog-undirty with a new config
>>>>>> (osd_debug_pg_log_writeout) which if set to false will disable some
>>>>>> strictly debugging checks which occur in PGLog::undirty().  We haven't
>>>>>> actually seen these checks causing excessive cpu usage, so this may be
>>>>>> a red herring.
>>>>>> -Sam
>>>>>>
>>>>>> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>>>>>> Hey Mark,
>>>>>>>
>>>>>>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
>>>>>>>> On 08/17/2013 06:13 AM, Oliver Daudey wrote:
>>>>>>>>> Hey all,
>>>>>>>>>
>>>>>>>>> This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
>>>>>>>>> created in the tracker.  Thought I would pass it through the list as
>>>>>>>>> well, to get an idea if anyone else is running into it.  It may only
>>>>>>>>> show under higher loads.  More info about my setup is in the bug-report
>>>>>>>>> above.  Here goes:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
>>>>>>>>> and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
>>>>>>>>> unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
>>>>>>>>> +MB/sec on simple linear writes to a file with `dd' inside a VM on this
>>>>>>>>> cluster under regular load and the osds usually averaged 20-100%
>>>>>>>>> CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
>>>>>>>>> the osds shot up to 100% to 400% in `top' (multi-core system) and the
>>>>>>>>> speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
>>>>>>>>> complained that disk-access inside the VMs was significantly slower and
>>>>>>>>> the backups of the RBD-store I was running, also got behind quickly.
>>>>>>>>>
>>>>>>>>> After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
>>>>>>>>> rest at 0.67 Dumpling, speed and load returned to normal. I have
>>>>>>>>> repeated this performance-hit upon upgrade on a similar test-cluster
>>>>>>>>> under no additional load at all. Although CPU-usage for the osds wasn't
>>>>>>>>> as dramatic during these tests because there was no base-load from other
>>>>>>>>> VMs, I/O-performance dropped significantly after upgrading during these
>>>>>>>>> tests as well, and returned to normal after downgrading the osds.
>>>>>>>>>
>>>>>>>>> I'm not sure what to make of it. There are no visible errors in the logs
>>>>>>>>> and everything runs and reports good health, it's just a lot slower,
>>>>>>>>> with a lot more CPU-usage.
>>>>>>>>
>>>>>>>> Hi Oliver,
>>>>>>>>
>>>>>>>> If you have access to the perf command on this system, could you try
>>>>>>>> running:
>>>>>>>>
>>>>>>>> "sudo perf top"
>>>>>>>>
>>>>>>>> And if that doesn't give you much,
>>>>>>>>
>>>>>>>> "sudo perf record -g"
>>>>>>>>
>>>>>>>> then:
>>>>>>>>
>>>>>>>> "sudo perf report | less"
>>>>>>>>
>>>>>>>> during the period of high CPU usage?  This will give you a call graph.
>>>>>>>> There may be symbols missing, but it might help track down what the OSDs
>>>>>>>> are doing.
>>>>>>>
>>>>>>> Thanks for your help!  I did a couple of runs on my test-cluster,
>>>>>>> loading it with writes from 3 VMs concurrently and measuring the results
>>>>>>> at the first node with all 0.67 Dumpling-components and with the osds
>>>>>>> replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
>>>>>>> while, then copied anything that showed in red and green into this post.
>>>>>>> Here are the results (sorry for the word-wraps):
>>>>>>>
>>>>>>> First, with 0.61.7 osds:
>>>>>>>
>>>>>>>   19.91%  [kernel]                    [k] intel_idle
>>>>>>>   10.18%  [kernel]                    [k] _raw_spin_lock_irqsave
>>>>>>>    6.79%  ceph-osd                    [.] ceph_crc32c_le
>>>>>>>    4.93%  [kernel]                    [k]
>>>>>>> default_send_IPI_mask_sequence_phys
>>>>>>>    2.71%  [kernel]                    [k] copy_user_generic_string
>>>>>>>    1.42%  libc-2.11.3.so              [.] memcpy
>>>>>>>    1.23%  [kernel]                    [k] find_busiest_group
>>>>>>>    1.13%  librados.so.2.0.0           [.] ceph_crc32c_le_intel
>>>>>>>    1.11%  [kernel]                    [k] _raw_spin_lock
>>>>>>>    0.99%  kvm                         [.] 0x1931f8
>>>>>>>    0.92%  [igb]                       [k] igb_poll
>>>>>>>    0.87%  [kernel]                    [k] native_write_cr0
>>>>>>>    0.80%  [kernel]                    [k] csum_partial
>>>>>>>    0.78%  [kernel]                    [k] __do_softirq
>>>>>>>    0.63%  [kernel]                    [k] hpet_legacy_next_event
>>>>>>>    0.53%  [ip_tables]                 [k] ipt_do_table
>>>>>>>    0.50%  libc-2.11.3.so              [.] 0x74433
>>>>>>>
>>>>>>> Second test, with 0.67 osds:
>>>>>>>
>>>>>>>   18.32%  [kernel]                      [k] intel_idle
>>>>>>>    7.58%  [kernel]                      [k] _raw_spin_lock_irqsave
>>>>>>>    7.04%  ceph-osd                      [.] PGLog::undirty()
>>>>>>>    4.39%  ceph-osd                      [.] ceph_crc32c_le_intel
>>>>>>>    3.92%  [kernel]                      [k]
>>>>>>> default_send_IPI_mask_sequence_phys
>>>>>>>    2.25%  [kernel]                      [k] copy_user_generic_string
>>>>>>>    1.76%  libc-2.11.3.so                [.] memcpy
>>>>>>>    1.56%  librados.so.2.0.0             [.] ceph_crc32c_le_intel
>>>>>>>    1.40%  libc-2.11.3.so                [.] vfprintf
>>>>>>>    1.12%  libc-2.11.3.so                [.] 0x7217b
>>>>>>>    1.05%  [kernel]                      [k] _raw_spin_lock
>>>>>>>    1.01%  [kernel]                      [k] find_busiest_group
>>>>>>>    0.83%  kvm                           [.] 0x193ab8
>>>>>>>    0.80%  [kernel]                      [k] native_write_cr0
>>>>>>>    0.76%  [kernel]                      [k] __do_softirq
>>>>>>>    0.73%  libc-2.11.3.so                [.] _IO_default_xsputn
>>>>>>>    0.70%  [kernel]                      [k] csum_partial
>>>>>>>    0.68%  [igb]                         [k] igb_poll
>>>>>>>    0.58%  [kernel]                      [k] hpet_legacy_next_event
>>>>>>>    0.54%  [kernel]                      [k] __schedule
>>>>>>>
>>>>>>>
>>>>>>> What jumps right out, is the "PGLog::undirty()", which doesn't show up
>>>>>>> with 0.61.7 at all, but is an extra drag right at top-usage in 0.67.
>>>>>>> Note that I didn't manage to fully load the test-cluster CPU-wise,
>>>>>>> because of network-constraints and I don't want to take any extra risks
>>>>>>> on the production-cluster and test it there, but it seems we found a
>>>>>>> possible culprit.
>>>>>>>
>>>>>>> Any ideas?  Thanks again!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     Regards,
>>>>>>>
>>>>>>>        Oliver
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com