Re: Significant slowdown of osds since v0.67 Dumpling

Oliver Daudey <oliver@xxxxxxxxx> · Wed, 21 Aug 2013 22:13:30 +0200

Hey Samuel,

CPU-usage still seems a bit higher, but not always equally on every OSD.
I profiled the node with the most CPU-usage on the OSD.  Note the
libleveldb-related stuff right at the top.  The Cuttlefish-OSD doesn't
show those at all.  Could those be related to the problem?

OSD version 0.67.1-10-g47c8949
(47c89497b7f69cbf1557cd05b89837c388e2ba2f) on my production-cluster:

 15.11%  [kernel]                     [k]
intel_idle                            
  8.61%  libleveldb.so.1.9            [.]
0x3b1b5                               
  7.47%  libc-2.11.3.so               [.]
memcmp                                
  5.47%  [kernel]                     [k]
find_busiest_group                    
  3.97%  kvm                          [.]
0x1fb2d0                              
  2.90%  libleveldb.so.1.9            [.]
leveldb::InternalKeyComparator::Compar
  1.72%  libc-2.11.3.so               [.]
memcpy                                
  1.56%  [kernel]                     [k]
_raw_spin_lock                        
  1.49%  libleveldb.so.1.9            [.]
leveldb::Block::Iter::Next()          
  1.48%  libsnappy.so.1.1.2           [.]
snappy::RawUncompress(snappy::Source*,
  1.15%  libstdc++.so.6.0.13          [.]
std::string::_M_mutate(unsigned long, 
  1.00%  [kernel]                     [k]
default_send_IPI_mask_sequence_phys   
  1.00%  [kernel]                     [k]
hrtimer_interrupt                     
  0.99%  [kernel]                     [k]
__hrtimer_start_range_ns              
  0.92%  [kernel]                     [k]
cpumask_next_and                      
  0.91%  [kernel]                     [k]
__schedule                            
  0.90%  [kernel]                     [k]
find_next_bit                         
  0.88%  [kernel]                     [k]
do_select                             
  0.86%  [kernel]                     [k]
_raw_spin_unlock_irqrestore           
  0.84%  [kernel]                     [k]
clockevents_program_event             
  0.83%  [kernel]                     [k]
fget_light                            
  0.76%  libstdc++.so.6.0.13          [.] std::string::append(char
const*, unsig
  0.76%  [kernel]                     [k]
_raw_spin_lock_irqsave                
  0.71%  [kernel]                     [k]
apic_timer_interrupt                  
  0.69%  [kernel]                     [k]
rcu_needs_cpu                         
  0.62%  [kernel]                     [k]
sync_inodes_sb                        
  0.60%  ceph-osd                     [.]
PGLog::undirty()                      
  0.58%  [kvm_intel]                  [k]
vmx_vcpu_run                          
  0.56%  [kernel]                     [k]
native_write_cr0                      
  0.54%  libpthread-2.11.3.so         [.]
pthread_mutex_lock                    
  0.54%  [kernel]                     [k]
native_write_msr_safe                 
  0.51%  [kernel]                     [k] load_balance

   Regards,

      Oliver

On wo, 2013-08-21 at 12:05 -0700, Samuel Just wrote:
> There haven't been any significant osd side changes that I can think
> of.  Is cpu usage still high?  If so, can you post the profiler
> results again?
> -Sam
> 
> On Wed, Aug 21, 2013 at 12:02 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
> > Hey Samuel,
> >
> > I had a good run on the production-cluster with it and unfortunately, it
> > still doesn't seem to have solved the problem.  It seemed OK for a while
> > and individual OSD CPU-usage seemed quite low, but as the cluster's load
> > increased during the day, things got slower again.  Write-performance
> > within a VM crawled to 30MB/sec and at some point, I got only 10MB/sec
> > on reads in that same VM.  I also did RADOS bench-tests with `rados
> > --pool rbd bench 120 write' and those got several hundreds of MB's/sec
> > on the same cluster at the same time of day, so maybe the problem is
> > RBD-related.  Is there any code in the OSD that could influence
> > RBD-performance alone?  Do you know of any other significant changes to
> > the OSD between Cuttlefish and Dumpling that could result in this?
> >
> > PS: I also did the same RADOS bench-tests on my test-cluster, both with
> > Cuttlefish and Dumpling without your fix and got almost identical
> > results.  This confirms that the problem might be in RBD, as Mark suggested.
> >
> >
> >     Regards,
> >
> >        Oliver
> >
> > On 20-08-13 19:40, Samuel Just wrote:
> >> Can you try dumpling head without the option?
> >> -Sam
> >>
> >> On Tue, Aug 20, 2013 at 1:44 AM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
> >>> Hey Mark,
> >>>
> >>> Sorry, but after some more tests I have to report that it only worked
> >>> partly.  The load seems lower with "wip-dumpling-pglog-undirty" in
> >>> place, but the Cuttlefish-osd still seems significantly faster and even
> >>> with "wip-dumpling-pglog-undirty" in place, things slow down way too
> >>> much under load.  Unfortunately, only my production-cluster seems busy
> >>> enough to actually show the problem clearly by slowing down.  Below is
> >>> `perf top'-output, fresh from my production-cluster under it's regular
> >>> load:
> >>>
> >>> First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout =
> >>> false":
> >>>  16.53%  [kernel]                     [k]
> >>> intel_idle
> >>>   6.47%  libleveldb.so.1.9            [.]
> >>> 0x380a1
> >>>   5.76%  [kernel]                     [k]
> >>> find_busiest_group
> >>>   4.11%  libc-2.11.3.so               [.]
> >>> memcmp
> >>>   3.95%  kvm                          [.]
> >>> 0x1f6f31
> >>>   2.05%  [kernel]                     [k]
> >>> default_send_IPI_mask_sequence_phys
> >>>   2.03%  [kernel]                     [k]
> >>> _raw_spin_lock
> >>>   1.87%  libleveldb.so.1.9            [.]
> >>> leveldb::InternalKeyComparator::Compar
> >>>   1.57%  libc-2.11.3.so               [.]
> >>> memcpy
> >>>   1.37%  libleveldb.so.1.9            [.]
> >>> leveldb::Block::Iter::Next()
> >>>   1.26%  [kernel]                     [k]
> >>> hrtimer_interrupt
> >>>   1.12%  [kernel]                     [k]
> >>> __hrtimer_start_range_ns
> >>>   1.09%  [kernel]                     [k]
> >>> native_write_cr0
> >>>   1.05%  libstdc++.so.6.0.13          [.]
> >>> std::string::_M_mutate(unsigned long,
> >>>   1.00%  [kernel]                     [k]
> >>> native_write_msr_safe
> >>>   0.99%  [kernel]                     [k]
> >>> apic_timer_interrupt
> >>>   0.98%  [kernel]                     [k]
> >>> clockevents_program_event
> >>>   0.96%  [kernel]                     [k]
> >>> _raw_spin_unlock_irqrestore
> >>>   0.95%  ceph-osd                     [.]
> >>> PGLog::undirty()
> >>>   0.92%  [kernel]                     [k]
> >>> find_next_bit
> >>>   0.91%  libsnappy.so.1.1.2           [.]
> >>> snappy::RawUncompress(snappy::Source*,
> >>>   0.88%  [kernel]                     [k]
> >>> __schedule
> >>>   0.87%  [kernel]                     [k]
> >>> cpumask_next_and
> >>>   0.84%  [kernel]                     [k]
> >>> do_select
> >>>   0.80%  [kernel]                     [k]
> >>> fget_light
> >>>   0.77%  [kernel]                     [k]
> >>> reschedule_interrupt
> >>>   0.75%  [kernel]                     [k]
> >>> _raw_spin_lock_irqsave
> >>>   0.62%  libstdc++.so.6.0.13          [.] std::string::append(char
> >>> const*, unsig
> >>>   0.59%  [kvm_intel]                  [k]
> >>> vmx_vcpu_run
> >>>   0.58%  [kernel]                     [k]
> >>> copy_user_generic_string
> >>>   0.56%  [kernel]                     [k]
> >>> load_balance
> >>>   0.54%  [kernel]                     [k]
> >>> tg_load_down
> >>>   0.53%  libpthread-2.11.3.so         [.]
> >>> pthread_mutex_lock
> >>>   0.52%  [kernel]                     [k] sync_inodes_sb
> >>>
> >>> Second, the 0.61.8 osd, under identical load:
> >>>  21.51%  [kernel]                     [k]
> >>> intel_idle
> >>>   6.66%  [kernel]                     [k]
> >>> find_busiest_group
> >>>   6.25%  kvm                          [.]
> >>> 0x2d214b
> >>>   1.97%  [kernel]                     [k]
> >>> _raw_spin_lock
> >>>   1.47%  [kernel]                     [k]
> >>> native_write_msr_safe
> >>>   1.44%  [kernel]                     [k]
> >>> hrtimer_interrupt
> >>>   1.37%  [kernel]                     [k]
> >>> __hrtimer_start_range_ns
> >>>   1.34%  [kernel]                     [k]
> >>> do_select
> >>>   1.29%  [kernel]                     [k]
> >>> fget_light
> >>>   1.24%  [kernel]                     [k]
> >>> clockevents_program_event
> >>>   1.21%  [kernel]                     [k]
> >>> default_send_IPI_mask_sequence_phys
> >>>   1.18%  [kernel]                     [k]
> >>> cpumask_next_and
> >>>   1.18%  [kernel]                     [k]
> >>> _raw_spin_unlock_irqrestore
> >>>   1.15%  [kernel]                     [k]
> >>> find_next_bit
> >>>   1.14%  [kernel]                     [k]
> >>> __schedule
> >>>   1.11%  [kernel]                     [k]
> >>> _raw_spin_lock_irqsave
> >>>   0.98%  [kernel]                     [k]
> >>> apic_timer_interrupt
> >>>   0.86%  [kernel]                     [k]
> >>> copy_user_generic_string
> >>>   0.83%  [kernel]                     [k]
> >>> native_write_cr0
> >>>   0.76%  [kernel]                     [k]
> >>> sync_inodes_sb
> >>>   0.71%  [kernel]                     [k]
> >>> rcu_needs_cpu
> >>>   0.69%  libpthread-2.11.3.so         [.]
> >>> pthread_mutex_lock
> >>>   0.66%  [kernel]                     [k]
> >>> fput
> >>>   0.62%  [kernel]                     [k]
> >>> load_balance
> >>>   0.57%  [vdso]                       [.]
> >>> 0x7fff3a976700
> >>>   0.56%  libc-2.11.3.so               [.]
> >>> memcpy
> >>>   0.56%  [kernel]                     [k]
> >>> reschedule_interrupt
> >>>   0.56%  [kernel]                     [k]
> >>> tg_load_down
> >>>   0.50%  [kernel]                     [k] iput
> >>>
> >>> I see lots of new differences, but again don't know what to make of it
> >>> and what might be related or significant.  LevelDB seems to jump out
> >>> this time, amongst others.  Let me know if you need more info.
> >>>
> >>>
> >>>    Regards,
> >>>
> >>>      Oliver
> >>>
> >>> On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote:
> >>>> Hi Oliver,
> >>>>
> >>>> Glad that helped!  How much more efficient do the cuttlefish OSDs seem
> >>>> at this point (with wip-dumpling-pglog-undirty)?  On modern Intel
> >>>> platforms we were actually hoping to see CPU usage go down in many cases
> >>>> due to the use of hardware CRC32 instructions.
> >>>>
> >>>> Mark
> >>>>
> >>>> On 08/19/2013 03:06 PM, Oliver Daudey wrote:
> >>>>> Hey Samuel,
> >>>>>
> >>>>> Thanks!  I installed your version, repeated the same tests on my
> >>>>> test-cluster and the extra CPU-loading seems to have disappeared.  Then
> >>>>> I replaced one osd of my production-cluster with your modified version
> >>>>> and it's config-option and it seems to be a lot less CPU-hungry now.
> >>>>> Although the Cuttlefish-osds still seem to be even more CPU-efficient,
> >>>>> your changes have definitely helped a lot.  We seem to be looking in the
> >>>>> right direction, at least for this part of the problem.
> >>>>>
> >>>>> BTW, I ran `perf top' on the production-node with your modified osd and
> >>>>> didn't see anything osd-related stand out on top.  "PGLog::undirty()"
> >>>>> was in there, but with much lower usage, right at the bottom of the
> >>>>> green part of the output.
> >>>>>
> >>>>> Many thanks for your help so far!
> >>>>>
> >>>>>
> >>>>>     Regards,
> >>>>>
> >>>>>       Oliver
> >>>>>
> >>>>> On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
> >>>>>> You're right, PGLog::undirty() looks suspicious.  I just pushed a
> >>>>>> branch wip-dumpling-pglog-undirty with a new config
> >>>>>> (osd_debug_pg_log_writeout) which if set to false will disable some
> >>>>>> strictly debugging checks which occur in PGLog::undirty().  We haven't
> >>>>>> actually seen these checks causing excessive cpu usage, so this may be
> >>>>>> a red herring.
> >>>>>> -Sam
> >>>>>>
> >>>>>> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
> >>>>>>> Hey Mark,
> >>>>>>>
> >>>>>>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
> >>>>>>>> On 08/17/2013 06:13 AM, Oliver Daudey wrote:
> >>>>>>>>> Hey all,
> >>>>>>>>>
> >>>>>>>>> This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
> >>>>>>>>> created in the tracker.  Thought I would pass it through the list as
> >>>>>>>>> well, to get an idea if anyone else is running into it.  It may only
> >>>>>>>>> show under higher loads.  More info about my setup is in the bug-report
> >>>>>>>>> above.  Here goes:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
> >>>>>>>>> and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
> >>>>>>>>> unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
> >>>>>>>>> +MB/sec on simple linear writes to a file with `dd' inside a VM on this
> >>>>>>>>> cluster under regular load and the osds usually averaged 20-100%
> >>>>>>>>> CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
> >>>>>>>>> the osds shot up to 100% to 400% in `top' (multi-core system) and the
> >>>>>>>>> speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
> >>>>>>>>> complained that disk-access inside the VMs was significantly slower and
> >>>>>>>>> the backups of the RBD-store I was running, also got behind quickly.
> >>>>>>>>>
> >>>>>>>>> After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
> >>>>>>>>> rest at 0.67 Dumpling, speed and load returned to normal. I have
> >>>>>>>>> repeated this performance-hit upon upgrade on a similar test-cluster
> >>>>>>>>> under no additional load at all. Although CPU-usage for the osds wasn't
> >>>>>>>>> as dramatic during these tests because there was no base-load from other
> >>>>>>>>> VMs, I/O-performance dropped significantly after upgrading during these
> >>>>>>>>> tests as well, and returned to normal after downgrading the osds.
> >>>>>>>>>
> >>>>>>>>> I'm not sure what to make of it. There are no visible errors in the logs
> >>>>>>>>> and everything runs and reports good health, it's just a lot slower,
> >>>>>>>>> with a lot more CPU-usage.
> >>>>>>>>
> >>>>>>>> Hi Oliver,
> >>>>>>>>
> >>>>>>>> If you have access to the perf command on this system, could you try
> >>>>>>>> running:
> >>>>>>>>
> >>>>>>>> "sudo perf top"
> >>>>>>>>
> >>>>>>>> And if that doesn't give you much,
> >>>>>>>>
> >>>>>>>> "sudo perf record -g"
> >>>>>>>>
> >>>>>>>> then:
> >>>>>>>>
> >>>>>>>> "sudo perf report | less"
> >>>>>>>>
> >>>>>>>> during the period of high CPU usage?  This will give you a call graph.
> >>>>>>>> There may be symbols missing, but it might help track down what the OSDs
> >>>>>>>> are doing.
> >>>>>>>
> >>>>>>> Thanks for your help!  I did a couple of runs on my test-cluster,
> >>>>>>> loading it with writes from 3 VMs concurrently and measuring the results
> >>>>>>> at the first node with all 0.67 Dumpling-components and with the osds
> >>>>>>> replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
> >>>>>>> while, then copied anything that showed in red and green into this post.
> >>>>>>> Here are the results (sorry for the word-wraps):
> >>>>>>>
> >>>>>>> First, with 0.61.7 osds:
> >>>>>>>
> >>>>>>>   19.91%  [kernel]                    [k] intel_idle
> >>>>>>>   10.18%  [kernel]                    [k] _raw_spin_lock_irqsave
> >>>>>>>    6.79%  ceph-osd                    [.] ceph_crc32c_le
> >>>>>>>    4.93%  [kernel]                    [k]
> >>>>>>> default_send_IPI_mask_sequence_phys
> >>>>>>>    2.71%  [kernel]                    [k] copy_user_generic_string
> >>>>>>>    1.42%  libc-2.11.3.so              [.] memcpy
> >>>>>>>    1.23%  [kernel]                    [k] find_busiest_group
> >>>>>>>    1.13%  librados.so.2.0.0           [.] ceph_crc32c_le_intel
> >>>>>>>    1.11%  [kernel]                    [k] _raw_spin_lock
> >>>>>>>    0.99%  kvm                         [.] 0x1931f8
> >>>>>>>    0.92%  [igb]                       [k] igb_poll
> >>>>>>>    0.87%  [kernel]                    [k] native_write_cr0
> >>>>>>>    0.80%  [kernel]                    [k] csum_partial
> >>>>>>>    0.78%  [kernel]                    [k] __do_softirq
> >>>>>>>    0.63%  [kernel]                    [k] hpet_legacy_next_event
> >>>>>>>    0.53%  [ip_tables]                 [k] ipt_do_table
> >>>>>>>    0.50%  libc-2.11.3.so              [.] 0x74433
> >>>>>>>
> >>>>>>> Second test, with 0.67 osds:
> >>>>>>>
> >>>>>>>   18.32%  [kernel]                      [k] intel_idle
> >>>>>>>    7.58%  [kernel]                      [k] _raw_spin_lock_irqsave
> >>>>>>>    7.04%  ceph-osd                      [.] PGLog::undirty()
> >>>>>>>    4.39%  ceph-osd                      [.] ceph_crc32c_le_intel
> >>>>>>>    3.92%  [kernel]                      [k]
> >>>>>>> default_send_IPI_mask_sequence_phys
> >>>>>>>    2.25%  [kernel]                      [k] copy_user_generic_string
> >>>>>>>    1.76%  libc-2.11.3.so                [.] memcpy
> >>>>>>>    1.56%  librados.so.2.0.0             [.] ceph_crc32c_le_intel
> >>>>>>>    1.40%  libc-2.11.3.so                [.] vfprintf
> >>>>>>>    1.12%  libc-2.11.3.so                [.] 0x7217b
> >>>>>>>    1.05%  [kernel]                      [k] _raw_spin_lock
> >>>>>>>    1.01%  [kernel]                      [k] find_busiest_group
> >>>>>>>    0.83%  kvm                           [.] 0x193ab8
> >>>>>>>    0.80%  [kernel]                      [k] native_write_cr0
> >>>>>>>    0.76%  [kernel]                      [k] __do_softirq
> >>>>>>>    0.73%  libc-2.11.3.so                [.] _IO_default_xsputn
> >>>>>>>    0.70%  [kernel]                      [k] csum_partial
> >>>>>>>    0.68%  [igb]                         [k] igb_poll
> >>>>>>>    0.58%  [kernel]                      [k] hpet_legacy_next_event
> >>>>>>>    0.54%  [kernel]                      [k] __schedule
> >>>>>>>
> >>>>>>>
> >>>>>>> What jumps right out, is the "PGLog::undirty()", which doesn't show up
> >>>>>>> with 0.61.7 at all, but is an extra drag right at top-usage in 0.67.
> >>>>>>> Note that I didn't manage to fully load the test-cluster CPU-wise,
> >>>>>>> because of network-constraints and I don't want to take any extra risks
> >>>>>>> on the production-cluster and test it there, but it seems we found a
> >>>>>>> possible culprit.
> >>>>>>>
> >>>>>>> Any ideas?  Thanks again!
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>     Regards,
> >>>>>>>
> >>>>>>>        Oliver
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> ceph-users mailing list
> >>>>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com