Re: Significant slowdown of osds since v0.67 Dumpling

Samuel Just <sam.just@xxxxxxxxxxx> · Tue, 20 Aug 2013 15:38:32 -0700



Good to hear.
-Sam

On Tue, Aug 20, 2013 at 2:55 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
> Hey Samuel,
>
> I picked up 0.67.1-10-g47c8949 from the GIT-builder and the osd from
> that seems to work great so far.  I'll have to let it run for a while
> longer to really be sure it fixed the problem, but it looks promising,
> not taking any more CPU than the Cuttlefish-osds.  Thanks!  I'll get
> back to you.
>
>
>    Regards,
>
>       Oliver
>
> On di, 2013-08-20 at 10:40 -0700, Samuel Just wrote:
>> Can you try dumpling head without the option?
>> -Sam
>>
>> On Tue, Aug 20, 2013 at 1:44 AM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>> > Hey Mark,
>> >
>> > Sorry, but after some more tests I have to report that it only worked
>> > partly.  The load seems lower with "wip-dumpling-pglog-undirty" in
>> > place, but the Cuttlefish-osd still seems significantly faster and even
>> > with "wip-dumpling-pglog-undirty" in place, things slow down way too
>> > much under load.  Unfortunately, only my production-cluster seems busy
>> > enough to actually show the problem clearly by slowing down.  Below is
>> > `perf top'-output, fresh from my production-cluster under it's regular
>> > load:
>> >
>> > First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout =
>> > false":
>> >  16.53%  [kernel]                     [k]
>> > intel_idle
>> >   6.47%  libleveldb.so.1.9            [.]
>> > 0x380a1
>> >   5.76%  [kernel]                     [k]
>> > find_busiest_group
>> >   4.11%  libc-2.11.3.so               [.]
>> > memcmp
>> >   3.95%  kvm                          [.]
>> > 0x1f6f31
>> >   2.05%  [kernel]                     [k]
>> > default_send_IPI_mask_sequence_phys
>> >   2.03%  [kernel]                     [k]
>> > _raw_spin_lock
>> >   1.87%  libleveldb.so.1.9            [.]
>> > leveldb::InternalKeyComparator::Compar
>> >   1.57%  libc-2.11.3.so               [.]
>> > memcpy
>> >   1.37%  libleveldb.so.1.9            [.]
>> > leveldb::Block::Iter::Next()
>> >   1.26%  [kernel]                     [k]
>> > hrtimer_interrupt
>> >   1.12%  [kernel]                     [k]
>> > __hrtimer_start_range_ns
>> >   1.09%  [kernel]                     [k]
>> > native_write_cr0
>> >   1.05%  libstdc++.so.6.0.13          [.]
>> > std::string::_M_mutate(unsigned long,
>> >   1.00%  [kernel]                     [k]
>> > native_write_msr_safe
>> >   0.99%  [kernel]                     [k]
>> > apic_timer_interrupt
>> >   0.98%  [kernel]                     [k]
>> > clockevents_program_event
>> >   0.96%  [kernel]                     [k]
>> > _raw_spin_unlock_irqrestore
>> >   0.95%  ceph-osd                     [.]
>> > PGLog::undirty()
>> >   0.92%  [kernel]                     [k]
>> > find_next_bit
>> >   0.91%  libsnappy.so.1.1.2           [.]
>> > snappy::RawUncompress(snappy::Source*,
>> >   0.88%  [kernel]                     [k]
>> > __schedule
>> >   0.87%  [kernel]                     [k]
>> > cpumask_next_and
>> >   0.84%  [kernel]                     [k]
>> > do_select
>> >   0.80%  [kernel]                     [k]
>> > fget_light
>> >   0.77%  [kernel]                     [k]
>> > reschedule_interrupt
>> >   0.75%  [kernel]                     [k]
>> > _raw_spin_lock_irqsave
>> >   0.62%  libstdc++.so.6.0.13          [.] std::string::append(char
>> > const*, unsig
>> >   0.59%  [kvm_intel]                  [k]
>> > vmx_vcpu_run
>> >   0.58%  [kernel]                     [k]
>> > copy_user_generic_string
>> >   0.56%  [kernel]                     [k]
>> > load_balance
>> >   0.54%  [kernel]                     [k]
>> > tg_load_down
>> >   0.53%  libpthread-2.11.3.so         [.]
>> > pthread_mutex_lock
>> >   0.52%  [kernel]                     [k] sync_inodes_sb
>> >
>> > Second, the 0.61.8 osd, under identical load:
>> >  21.51%  [kernel]                     [k]
>> > intel_idle
>> >   6.66%  [kernel]                     [k]
>> > find_busiest_group
>> >   6.25%  kvm                          [.]
>> > 0x2d214b
>> >   1.97%  [kernel]                     [k]
>> > _raw_spin_lock
>> >   1.47%  [kernel]                     [k]
>> > native_write_msr_safe
>> >   1.44%  [kernel]                     [k]
>> > hrtimer_interrupt
>> >   1.37%  [kernel]                     [k]
>> > __hrtimer_start_range_ns
>> >   1.34%  [kernel]                     [k]
>> > do_select
>> >   1.29%  [kernel]                     [k]
>> > fget_light
>> >   1.24%  [kernel]                     [k]
>> > clockevents_program_event
>> >   1.21%  [kernel]                     [k]
>> > default_send_IPI_mask_sequence_phys
>> >   1.18%  [kernel]                     [k]
>> > cpumask_next_and
>> >   1.18%  [kernel]                     [k]
>> > _raw_spin_unlock_irqrestore
>> >   1.15%  [kernel]                     [k]
>> > find_next_bit
>> >   1.14%  [kernel]                     [k]
>> > __schedule
>> >   1.11%  [kernel]                     [k]
>> > _raw_spin_lock_irqsave
>> >   0.98%  [kernel]                     [k]
>> > apic_timer_interrupt
>> >   0.86%  [kernel]                     [k]
>> > copy_user_generic_string
>> >   0.83%  [kernel]                     [k]
>> > native_write_cr0
>> >   0.76%  [kernel]                     [k]
>> > sync_inodes_sb
>> >   0.71%  [kernel]                     [k]
>> > rcu_needs_cpu
>> >   0.69%  libpthread-2.11.3.so         [.]
>> > pthread_mutex_lock
>> >   0.66%  [kernel]                     [k]
>> > fput
>> >   0.62%  [kernel]                     [k]
>> > load_balance
>> >   0.57%  [vdso]                       [.]
>> > 0x7fff3a976700
>> >   0.56%  libc-2.11.3.so               [.]
>> > memcpy
>> >   0.56%  [kernel]                     [k]
>> > reschedule_interrupt
>> >   0.56%  [kernel]                     [k]
>> > tg_load_down
>> >   0.50%  [kernel]                     [k] iput
>> >
>> > I see lots of new differences, but again don't know what to make of it
>> > and what might be related or significant.  LevelDB seems to jump out
>> > this time, amongst others.  Let me know if you need more info.
>> >
>> >
>> >    Regards,
>> >
>> >      Oliver
>> >
>> > On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote:
>> >> Hi Oliver,
>> >>
>> >> Glad that helped!  How much more efficient do the cuttlefish OSDs seem
>> >> at this point (with wip-dumpling-pglog-undirty)?  On modern Intel
>> >> platforms we were actually hoping to see CPU usage go down in many cases
>> >> due to the use of hardware CRC32 instructions.
>> >>
>> >> Mark
>> >>
>> >> On 08/19/2013 03:06 PM, Oliver Daudey wrote:
>> >> > Hey Samuel,
>> >> >
>> >> > Thanks!  I installed your version, repeated the same tests on my
>> >> > test-cluster and the extra CPU-loading seems to have disappeared.  Then
>> >> > I replaced one osd of my production-cluster with your modified version
>> >> > and it's config-option and it seems to be a lot less CPU-hungry now.
>> >> > Although the Cuttlefish-osds still seem to be even more CPU-efficient,
>> >> > your changes have definitely helped a lot.  We seem to be looking in the
>> >> > right direction, at least for this part of the problem.
>> >> >
>> >> > BTW, I ran `perf top' on the production-node with your modified osd and
>> >> > didn't see anything osd-related stand out on top.  "PGLog::undirty()"
>> >> > was in there, but with much lower usage, right at the bottom of the
>> >> > green part of the output.
>> >> >
>> >> > Many thanks for your help so far!
>> >> >
>> >> >
>> >> >     Regards,
>> >> >
>> >> >       Oliver
>> >> >
>> >> > On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
>> >> >> You're right, PGLog::undirty() looks suspicious.  I just pushed a
>> >> >> branch wip-dumpling-pglog-undirty with a new config
>> >> >> (osd_debug_pg_log_writeout) which if set to false will disable some
>> >> >> strictly debugging checks which occur in PGLog::undirty().  We haven't
>> >> >> actually seen these checks causing excessive cpu usage, so this may be
>> >> >> a red herring.
>> >> >> -Sam
>> >> >>
>> >> >> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>> >> >>> Hey Mark,
>> >> >>>
>> >> >>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
>> >> >>>> On 08/17/2013 06:13 AM, Oliver Daudey wrote:
>> >> >>>>> Hey all,
>> >> >>>>>
>> >> >>>>> This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
>> >> >>>>> created in the tracker.  Thought I would pass it through the list as
>> >> >>>>> well, to get an idea if anyone else is running into it.  It may only
>> >> >>>>> show under higher loads.  More info about my setup is in the bug-report
>> >> >>>>> above.  Here goes:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
>> >> >>>>> and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
>> >> >>>>> unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
>> >> >>>>> +MB/sec on simple linear writes to a file with `dd' inside a VM on this
>> >> >>>>> cluster under regular load and the osds usually averaged 20-100%
>> >> >>>>> CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
>> >> >>>>> the osds shot up to 100% to 400% in `top' (multi-core system) and the
>> >> >>>>> speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
>> >> >>>>> complained that disk-access inside the VMs was significantly slower and
>> >> >>>>> the backups of the RBD-store I was running, also got behind quickly.
>> >> >>>>>
>> >> >>>>> After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
>> >> >>>>> rest at 0.67 Dumpling, speed and load returned to normal. I have
>> >> >>>>> repeated this performance-hit upon upgrade on a similar test-cluster
>> >> >>>>> under no additional load at all. Although CPU-usage for the osds wasn't
>> >> >>>>> as dramatic during these tests because there was no base-load from other
>> >> >>>>> VMs, I/O-performance dropped significantly after upgrading during these
>> >> >>>>> tests as well, and returned to normal after downgrading the osds.
>> >> >>>>>
>> >> >>>>> I'm not sure what to make of it. There are no visible errors in the logs
>> >> >>>>> and everything runs and reports good health, it's just a lot slower,
>> >> >>>>> with a lot more CPU-usage.
>> >> >>>>
>> >> >>>> Hi Oliver,
>> >> >>>>
>> >> >>>> If you have access to the perf command on this system, could you try
>> >> >>>> running:
>> >> >>>>
>> >> >>>> "sudo perf top"
>> >> >>>>
>> >> >>>> And if that doesn't give you much,
>> >> >>>>
>> >> >>>> "sudo perf record -g"
>> >> >>>>
>> >> >>>> then:
>> >> >>>>
>> >> >>>> "sudo perf report | less"
>> >> >>>>
>> >> >>>> during the period of high CPU usage?  This will give you a call graph.
>> >> >>>> There may be symbols missing, but it might help track down what the OSDs
>> >> >>>> are doing.
>> >> >>>
>> >> >>> Thanks for your help!  I did a couple of runs on my test-cluster,
>> >> >>> loading it with writes from 3 VMs concurrently and measuring the results
>> >> >>> at the first node with all 0.67 Dumpling-components and with the osds
>> >> >>> replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
>> >> >>> while, then copied anything that showed in red and green into this post.
>> >> >>> Here are the results (sorry for the word-wraps):
>> >> >>>
>> >> >>> First, with 0.61.7 osds:
>> >> >>>
>> >> >>>   19.91%  [kernel]                    [k] intel_idle
>> >> >>>   10.18%  [kernel]                    [k] _raw_spin_lock_irqsave
>> >> >>>    6.79%  ceph-osd                    [.] ceph_crc32c_le
>> >> >>>    4.93%  [kernel]                    [k]
>> >> >>> default_send_IPI_mask_sequence_phys
>> >> >>>    2.71%  [kernel]                    [k] copy_user_generic_string
>> >> >>>    1.42%  libc-2.11.3.so              [.] memcpy
>> >> >>>    1.23%  [kernel]                    [k] find_busiest_group
>> >> >>>    1.13%  librados.so.2.0.0           [.] ceph_crc32c_le_intel
>> >> >>>    1.11%  [kernel]                    [k] _raw_spin_lock
>> >> >>>    0.99%  kvm                         [.] 0x1931f8
>> >> >>>    0.92%  [igb]                       [k] igb_poll
>> >> >>>    0.87%  [kernel]                    [k] native_write_cr0
>> >> >>>    0.80%  [kernel]                    [k] csum_partial
>> >> >>>    0.78%  [kernel]                    [k] __do_softirq
>> >> >>>    0.63%  [kernel]                    [k] hpet_legacy_next_event
>> >> >>>    0.53%  [ip_tables]                 [k] ipt_do_table
>> >> >>>    0.50%  libc-2.11.3.so              [.] 0x74433
>> >> >>>
>> >> >>> Second test, with 0.67 osds:
>> >> >>>
>> >> >>>   18.32%  [kernel]                      [k] intel_idle
>> >> >>>    7.58%  [kernel]                      [k] _raw_spin_lock_irqsave
>> >> >>>    7.04%  ceph-osd                      [.] PGLog::undirty()
>> >> >>>    4.39%  ceph-osd                      [.] ceph_crc32c_le_intel
>> >> >>>    3.92%  [kernel]                      [k]
>> >> >>> default_send_IPI_mask_sequence_phys
>> >> >>>    2.25%  [kernel]                      [k] copy_user_generic_string
>> >> >>>    1.76%  libc-2.11.3.so                [.] memcpy
>> >> >>>    1.56%  librados.so.2.0.0             [.] ceph_crc32c_le_intel
>> >> >>>    1.40%  libc-2.11.3.so                [.] vfprintf
>> >> >>>    1.12%  libc-2.11.3.so                [.] 0x7217b
>> >> >>>    1.05%  [kernel]                      [k] _raw_spin_lock
>> >> >>>    1.01%  [kernel]                      [k] find_busiest_group
>> >> >>>    0.83%  kvm                           [.] 0x193ab8
>> >> >>>    0.80%  [kernel]                      [k] native_write_cr0
>> >> >>>    0.76%  [kernel]                      [k] __do_softirq
>> >> >>>    0.73%  libc-2.11.3.so                [.] _IO_default_xsputn
>> >> >>>    0.70%  [kernel]                      [k] csum_partial
>> >> >>>    0.68%  [igb]                         [k] igb_poll
>> >> >>>    0.58%  [kernel]                      [k] hpet_legacy_next_event
>> >> >>>    0.54%  [kernel]                      [k] __schedule
>> >> >>>
>> >> >>>
>> >> >>> What jumps right out, is the "PGLog::undirty()", which doesn't show up
>> >> >>> with 0.61.7 at all, but is an extra drag right at top-usage in 0.67.
>> >> >>> Note that I didn't manage to fully load the test-cluster CPU-wise,
>> >> >>> because of network-constraints and I don't want to take any extra risks
>> >> >>> on the production-cluster and test it there, but it seems we found a
>> >> >>> possible culprit.
>> >> >>>
>> >> >>> Any ideas?  Thanks again!
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>     Regards,
>> >> >>>
>> >> >>>        Oliver
>> >> >>>
>> >> >>> _______________________________________________
>> >> >>> ceph-users mailing list
>> >> >>> ceph-users@xxxxxxxxxxxxxx
>> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > ceph-users@xxxxxxxxxxxxxx
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >>
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@xxxxxxxxxxxxxx
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com