Re: Significant slowdown of osds since v0.67 Dumpling

Oliver Daudey <oliver@xxxxxxxxxx> · Tue, 20 Aug 2013 20:26:51 +0200

Hey Mark,

I'm working on a simple way to reproduce this if we fail to identify it
and `rados bench' indeed looks promising, as it can also simulate many
concurrent ops, which is probably what sets the 3 VM load on my
test-cluster apart from the +-80 VM load on my production-cluster.  I'll
look into it and try to get it reproduced on as simple a setup as
possible, thanks!

The command I used in VMs is a simple `dd if=/dev/zero of=<somefile>
bs=1024M count=2'.

Samuel asked me to try the dumpling head-branch, so I'll do that first.
 Thanks!

   Regards,

      Oliver

On 20-08-13 19:53, Mark Nelson wrote:
> Hi Oliver,
> 
> Have you ever tried using RADOS bench?
> 
> On one of the client nodes, you can do:
> 
> rados -p <pool> -b <size in bytes> bench write 120 -t <concurrent ops>
> 
> It would be useful to know if that is also slower without any of the RBD
> code involved.  Btw, what was the dd command line you were using for
> testing?
> 
> Thanks,
> Mark
> 
> 
> On 08/20/2013 03:44 AM, Oliver Daudey wrote:
>> Hey Mark,
>>
>> Sorry, but after some more tests I have to report that it only worked
>> partly.  The load seems lower with "wip-dumpling-pglog-undirty" in
>> place, but the Cuttlefish-osd still seems significantly faster and even
>> with "wip-dumpling-pglog-undirty" in place, things slow down way too
>> much under load.  Unfortunately, only my production-cluster seems busy
>> enough to actually show the problem clearly by slowing down.  Below is
>> `perf top'-output, fresh from my production-cluster under it's regular
>> load:
>>
>> First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout =
>> false":
>>   16.53%  [kernel]                     [k]
>> intel_idle
>>    6.47%  libleveldb.so.1.9            [.]
>> 0x380a1
>>    5.76%  [kernel]                     [k]
>> find_busiest_group
>>    4.11%  libc-2.11.3.so               [.]
>> memcmp
>>    3.95%  kvm                          [.]
>> 0x1f6f31
>>    2.05%  [kernel]                     [k]
>> default_send_IPI_mask_sequence_phys
>>    2.03%  [kernel]                     [k]
>> _raw_spin_lock
>>    1.87%  libleveldb.so.1.9            [.]
>> leveldb::InternalKeyComparator::Compar
>>    1.57%  libc-2.11.3.so               [.]
>> memcpy
>>    1.37%  libleveldb.so.1.9            [.]
>> leveldb::Block::Iter::Next()
>>    1.26%  [kernel]                     [k]
>> hrtimer_interrupt
>>    1.12%  [kernel]                     [k]
>> __hrtimer_start_range_ns
>>    1.09%  [kernel]                     [k]
>> native_write_cr0
>>    1.05%  libstdc++.so.6.0.13          [.]
>> std::string::_M_mutate(unsigned long,
>>    1.00%  [kernel]                     [k]
>> native_write_msr_safe
>>    0.99%  [kernel]                     [k]
>> apic_timer_interrupt
>>    0.98%  [kernel]                     [k]
>> clockevents_program_event
>>    0.96%  [kernel]                     [k]
>> _raw_spin_unlock_irqrestore
>>    0.95%  ceph-osd                     [.]
>> PGLog::undirty()
>>    0.92%  [kernel]                     [k]
>> find_next_bit
>>    0.91%  libsnappy.so.1.1.2           [.]
>> snappy::RawUncompress(snappy::Source*,
>>    0.88%  [kernel]                     [k]
>> __schedule
>>    0.87%  [kernel]                     [k]
>> cpumask_next_and
>>    0.84%  [kernel]                     [k]
>> do_select
>>    0.80%  [kernel]                     [k]
>> fget_light
>>    0.77%  [kernel]                     [k]
>> reschedule_interrupt
>>    0.75%  [kernel]                     [k]
>> _raw_spin_lock_irqsave
>>    0.62%  libstdc++.so.6.0.13          [.] std::string::append(char
>> const*, unsig
>>    0.59%  [kvm_intel]                  [k]
>> vmx_vcpu_run
>>    0.58%  [kernel]                     [k]
>> copy_user_generic_string
>>    0.56%  [kernel]                     [k]
>> load_balance
>>    0.54%  [kernel]                     [k]
>> tg_load_down
>>    0.53%  libpthread-2.11.3.so         [.]
>> pthread_mutex_lock
>>    0.52%  [kernel]                     [k] sync_inodes_sb
>>
>> Second, the 0.61.8 osd, under identical load:
>>   21.51%  [kernel]                     [k]
>> intel_idle
>>    6.66%  [kernel]                     [k]
>> find_busiest_group
>>    6.25%  kvm                          [.]
>> 0x2d214b
>>    1.97%  [kernel]                     [k]
>> _raw_spin_lock
>>    1.47%  [kernel]                     [k]
>> native_write_msr_safe
>>    1.44%  [kernel]                     [k]
>> hrtimer_interrupt
>>    1.37%  [kernel]                     [k]
>> __hrtimer_start_range_ns
>>    1.34%  [kernel]                     [k]
>> do_select
>>    1.29%  [kernel]                     [k]
>> fget_light
>>    1.24%  [kernel]                     [k]
>> clockevents_program_event
>>    1.21%  [kernel]                     [k]
>> default_send_IPI_mask_sequence_phys
>>    1.18%  [kernel]                     [k]
>> cpumask_next_and
>>    1.18%  [kernel]                     [k]
>> _raw_spin_unlock_irqrestore
>>    1.15%  [kernel]                     [k]
>> find_next_bit
>>    1.14%  [kernel]                     [k]
>> __schedule
>>    1.11%  [kernel]                     [k]
>> _raw_spin_lock_irqsave
>>    0.98%  [kernel]                     [k]
>> apic_timer_interrupt
>>    0.86%  [kernel]                     [k]
>> copy_user_generic_string
>>    0.83%  [kernel]                     [k]
>> native_write_cr0
>>    0.76%  [kernel]                     [k]
>> sync_inodes_sb
>>    0.71%  [kernel]                     [k]
>> rcu_needs_cpu
>>    0.69%  libpthread-2.11.3.so         [.]
>> pthread_mutex_lock
>>    0.66%  [kernel]                     [k]
>> fput
>>    0.62%  [kernel]                     [k]
>> load_balance
>>    0.57%  [vdso]                       [.]
>> 0x7fff3a976700
>>    0.56%  libc-2.11.3.so               [.]
>> memcpy
>>    0.56%  [kernel]                     [k]
>> reschedule_interrupt
>>    0.56%  [kernel]                     [k]
>> tg_load_down
>>    0.50%  [kernel]                     [k] iput
>>
>> I see lots of new differences, but again don't know what to make of it
>> and what might be related or significant.  LevelDB seems to jump out
>> this time, amongst others.  Let me know if you need more info.
>>
>>
>>     Regards,
>>
>>       Oliver
>>
>> On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote:
>>> Hi Oliver,
>>>
>>> Glad that helped!  How much more efficient do the cuttlefish OSDs seem
>>> at this point (with wip-dumpling-pglog-undirty)?  On modern Intel
>>> platforms we were actually hoping to see CPU usage go down in many cases
>>> due to the use of hardware CRC32 instructions.
>>>
>>> Mark
>>>
>>> On 08/19/2013 03:06 PM, Oliver Daudey wrote:
>>>> Hey Samuel,
>>>>
>>>> Thanks!  I installed your version, repeated the same tests on my
>>>> test-cluster and the extra CPU-loading seems to have disappeared.  Then
>>>> I replaced one osd of my production-cluster with your modified version
>>>> and it's config-option and it seems to be a lot less CPU-hungry now.
>>>> Although the Cuttlefish-osds still seem to be even more CPU-efficient,
>>>> your changes have definitely helped a lot.  We seem to be looking in
>>>> the
>>>> right direction, at least for this part of the problem.
>>>>
>>>> BTW, I ran `perf top' on the production-node with your modified osd and
>>>> didn't see anything osd-related stand out on top.  "PGLog::undirty()"
>>>> was in there, but with much lower usage, right at the bottom of the
>>>> green part of the output.
>>>>
>>>> Many thanks for your help so far!
>>>>
>>>>
>>>>      Regards,
>>>>
>>>>        Oliver
>>>>
>>>> On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
>>>>> You're right, PGLog::undirty() looks suspicious.  I just pushed a
>>>>> branch wip-dumpling-pglog-undirty with a new config
>>>>> (osd_debug_pg_log_writeout) which if set to false will disable some
>>>>> strictly debugging checks which occur in PGLog::undirty().  We haven't
>>>>> actually seen these checks causing excessive cpu usage, so this may be
>>>>> a red herring.
>>>>> -Sam
>>>>>
>>>>> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx>
>>>>> wrote:
>>>>>> Hey Mark,
>>>>>>
>>>>>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
>>>>>>> On 08/17/2013 06:13 AM, Oliver Daudey wrote:
>>>>>>>> Hey all,
>>>>>>>>
>>>>>>>> This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
>>>>>>>> created in the tracker.  Thought I would pass it through the
>>>>>>>> list as
>>>>>>>> well, to get an idea if anyone else is running into it.  It may
>>>>>>>> only
>>>>>>>> show under higher loads.  More info about my setup is in the
>>>>>>>> bug-report
>>>>>>>> above.  Here goes:
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm running a Ceph-cluster with 3 nodes, each of which runs a
>>>>>>>> mon, osd
>>>>>>>> and mds. I'm using RBD on this cluster as storage for KVM,
>>>>>>>> CephFS is
>>>>>>>> unused at this time. While still on v0.61.7 Cuttlefish, I got
>>>>>>>> 70-100
>>>>>>>> +MB/sec on simple linear writes to a file with `dd' inside a VM
>>>>>>>> on this
>>>>>>>> cluster under regular load and the osds usually averaged 20-100%
>>>>>>>> CPU-utilisation in `top'. After the upgrade to Dumpling,
>>>>>>>> CPU-usage for
>>>>>>>> the osds shot up to 100% to 400% in `top' (multi-core system)
>>>>>>>> and the
>>>>>>>> speed for my writes with `dd' inside a VM dropped to
>>>>>>>> 20-40MB/sec. Users
>>>>>>>> complained that disk-access inside the VMs was significantly
>>>>>>>> slower and
>>>>>>>> the backups of the RBD-store I was running, also got behind
>>>>>>>> quickly.
>>>>>>>>
>>>>>>>> After downgrading only the osds to v0.61.7 Cuttlefish and
>>>>>>>> leaving the
>>>>>>>> rest at 0.67 Dumpling, speed and load returned to normal. I have
>>>>>>>> repeated this performance-hit upon upgrade on a similar
>>>>>>>> test-cluster
>>>>>>>> under no additional load at all. Although CPU-usage for the osds
>>>>>>>> wasn't
>>>>>>>> as dramatic during these tests because there was no base-load
>>>>>>>> from other
>>>>>>>> VMs, I/O-performance dropped significantly after upgrading
>>>>>>>> during these
>>>>>>>> tests as well, and returned to normal after downgrading the osds.
>>>>>>>>
>>>>>>>> I'm not sure what to make of it. There are no visible errors in
>>>>>>>> the logs
>>>>>>>> and everything runs and reports good health, it's just a lot
>>>>>>>> slower,
>>>>>>>> with a lot more CPU-usage.
>>>>>>>
>>>>>>> Hi Oliver,
>>>>>>>
>>>>>>> If you have access to the perf command on this system, could you try
>>>>>>> running:
>>>>>>>
>>>>>>> "sudo perf top"
>>>>>>>
>>>>>>> And if that doesn't give you much,
>>>>>>>
>>>>>>> "sudo perf record -g"
>>>>>>>
>>>>>>> then:
>>>>>>>
>>>>>>> "sudo perf report | less"
>>>>>>>
>>>>>>> during the period of high CPU usage?  This will give you a call
>>>>>>> graph.
>>>>>>> There may be symbols missing, but it might help track down what
>>>>>>> the OSDs
>>>>>>> are doing.
>>>>>>
>>>>>> Thanks for your help!  I did a couple of runs on my test-cluster,
>>>>>> loading it with writes from 3 VMs concurrently and measuring the
>>>>>> results
>>>>>> at the first node with all 0.67 Dumpling-components and with the osds
>>>>>> replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
>>>>>> while, then copied anything that showed in red and green into this
>>>>>> post.
>>>>>> Here are the results (sorry for the word-wraps):
>>>>>>
>>>>>> First, with 0.61.7 osds:
>>>>>>
>>>>>>    19.91%  [kernel]                    [k] intel_idle
>>>>>>    10.18%  [kernel]                    [k] _raw_spin_lock_irqsave
>>>>>>     6.79%  ceph-osd                    [.] ceph_crc32c_le
>>>>>>     4.93%  [kernel]                    [k]
>>>>>> default_send_IPI_mask_sequence_phys
>>>>>>     2.71%  [kernel]                    [k] copy_user_generic_string
>>>>>>     1.42%  libc-2.11.3.so              [.] memcpy
>>>>>>     1.23%  [kernel]                    [k] find_busiest_group
>>>>>>     1.13%  librados.so.2.0.0           [.] ceph_crc32c_le_intel
>>>>>>     1.11%  [kernel]                    [k] _raw_spin_lock
>>>>>>     0.99%  kvm                         [.] 0x1931f8
>>>>>>     0.92%  [igb]                       [k] igb_poll
>>>>>>     0.87%  [kernel]                    [k] native_write_cr0
>>>>>>     0.80%  [kernel]                    [k] csum_partial
>>>>>>     0.78%  [kernel]                    [k] __do_softirq
>>>>>>     0.63%  [kernel]                    [k] hpet_legacy_next_event
>>>>>>     0.53%  [ip_tables]                 [k] ipt_do_table
>>>>>>     0.50%  libc-2.11.3.so              [.] 0x74433
>>>>>>
>>>>>> Second test, with 0.67 osds:
>>>>>>
>>>>>>    18.32%  [kernel]                      [k] intel_idle
>>>>>>     7.58%  [kernel]                      [k] _raw_spin_lock_irqsave
>>>>>>     7.04%  ceph-osd                      [.] PGLog::undirty()
>>>>>>     4.39%  ceph-osd                      [.] ceph_crc32c_le_intel
>>>>>>     3.92%  [kernel]                      [k]
>>>>>> default_send_IPI_mask_sequence_phys
>>>>>>     2.25%  [kernel]                      [k] copy_user_generic_string
>>>>>>     1.76%  libc-2.11.3.so                [.] memcpy
>>>>>>     1.56%  librados.so.2.0.0             [.] ceph_crc32c_le_intel
>>>>>>     1.40%  libc-2.11.3.so                [.] vfprintf
>>>>>>     1.12%  libc-2.11.3.so                [.] 0x7217b
>>>>>>     1.05%  [kernel]                      [k] _raw_spin_lock
>>>>>>     1.01%  [kernel]                      [k] find_busiest_group
>>>>>>     0.83%  kvm                           [.] 0x193ab8
>>>>>>     0.80%  [kernel]                      [k] native_write_cr0
>>>>>>     0.76%  [kernel]                      [k] __do_softirq
>>>>>>     0.73%  libc-2.11.3.so                [.] _IO_default_xsputn
>>>>>>     0.70%  [kernel]                      [k] csum_partial
>>>>>>     0.68%  [igb]                         [k] igb_poll
>>>>>>     0.58%  [kernel]                      [k] hpet_legacy_next_event
>>>>>>     0.54%  [kernel]                      [k] __schedule
>>>>>>
>>>>>>
>>>>>> What jumps right out, is the "PGLog::undirty()", which doesn't
>>>>>> show up
>>>>>> with 0.61.7 at all, but is an extra drag right at top-usage in 0.67.
>>>>>> Note that I didn't manage to fully load the test-cluster CPU-wise,
>>>>>> because of network-constraints and I don't want to take any extra
>>>>>> risks
>>>>>> on the production-cluster and test it there, but it seems we found a
>>>>>> possible culprit.
>>>>>>
>>>>>> Any ideas?  Thanks again!
>>>>>>
>>>>>>
>>>>>>
>>>>>>      Regards,
>>>>>>
>>>>>>         Oliver
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com