Re: Significant slowdown of osds since v0.67 Dumpling

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Tue, 20 Aug 2013 12:53:03 -0500

Hi Oliver,

Have you ever tried using RADOS bench?

On one of the client nodes, you can do:

rados -p <pool> -b <size in bytes> bench write 120 -t <concurrent ops>

It would be useful to know if that is also slower without any of the RBD 
code involved.  Btw, what was the dd command line you were using for 
testing?

Thanks,
Mark

On 08/20/2013 03:44 AM, Oliver Daudey wrote:
Hey Mark,

Sorry, but after some more tests I have to report that it only worked
partly.  The load seems lower with "wip-dumpling-pglog-undirty" in
place, but the Cuttlefish-osd still seems significantly faster and even
with "wip-dumpling-pglog-undirty" in place, things slow down way too
much under load.  Unfortunately, only my production-cluster seems busy
enough to actually show the problem clearly by slowing down.  Below is
`perf top'-output, fresh from my production-cluster under it's regular
load:

First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout =
false":
  16.53%  [kernel]                     [k]
intel_idle
   6.47%  libleveldb.so.1.9            [.]
0x380a1
   5.76%  [kernel]                     [k]
find_busiest_group
   4.11%  libc-2.11.3.so               [.]
memcmp
   3.95%  kvm                          [.]
0x1f6f31
   2.05%  [kernel]                     [k]
default_send_IPI_mask_sequence_phys
   2.03%  [kernel]                     [k]
_raw_spin_lock
   1.87%  libleveldb.so.1.9            [.]
leveldb::InternalKeyComparator::Compar
   1.57%  libc-2.11.3.so               [.]
memcpy
   1.37%  libleveldb.so.1.9            [.]
leveldb::Block::Iter::Next()
   1.26%  [kernel]                     [k]
hrtimer_interrupt
   1.12%  [kernel]                     [k]
__hrtimer_start_range_ns
   1.09%  [kernel]                     [k]
native_write_cr0
   1.05%  libstdc++.so.6.0.13          [.]
std::string::_M_mutate(unsigned long,
   1.00%  [kernel]                     [k]
native_write_msr_safe
   0.99%  [kernel]                     [k]
apic_timer_interrupt
   0.98%  [kernel]                     [k]
clockevents_program_event
   0.96%  [kernel]                     [k]
_raw_spin_unlock_irqrestore
   0.95%  ceph-osd                     [.]
PGLog::undirty()
   0.92%  [kernel]                     [k]
find_next_bit
   0.91%  libsnappy.so.1.1.2           [.]
snappy::RawUncompress(snappy::Source*,
   0.88%  [kernel]                     [k]
__schedule
   0.87%  [kernel]                     [k]
cpumask_next_and
   0.84%  [kernel]                     [k]
do_select
   0.80%  [kernel]                     [k]
fget_light
   0.77%  [kernel]                     [k]
reschedule_interrupt
   0.75%  [kernel]                     [k]
_raw_spin_lock_irqsave
   0.62%  libstdc++.so.6.0.13          [.] std::string::append(char
const*, unsig
   0.59%  [kvm_intel]                  [k]
vmx_vcpu_run
   0.58%  [kernel]                     [k]
copy_user_generic_string
   0.56%  [kernel]                     [k]
load_balance
   0.54%  [kernel]                     [k]
tg_load_down
   0.53%  libpthread-2.11.3.so         [.]
pthread_mutex_lock
   0.52%  [kernel]                     [k] sync_inodes_sb

Second, the 0.61.8 osd, under identical load:
  21.51%  [kernel]                     [k]
intel_idle
   6.66%  [kernel]                     [k]
find_busiest_group
   6.25%  kvm                          [.]
0x2d214b
   1.97%  [kernel]                     [k]
_raw_spin_lock
   1.47%  [kernel]                     [k]
native_write_msr_safe
   1.44%  [kernel]                     [k]
hrtimer_interrupt
   1.37%  [kernel]                     [k]
__hrtimer_start_range_ns
   1.34%  [kernel]                     [k]
do_select
   1.29%  [kernel]                     [k]
fget_light
   1.24%  [kernel]                     [k]
clockevents_program_event
   1.21%  [kernel]                     [k]
default_send_IPI_mask_sequence_phys
   1.18%  [kernel]                     [k]
cpumask_next_and
   1.18%  [kernel]                     [k]
_raw_spin_unlock_irqrestore
   1.15%  [kernel]                     [k]
find_next_bit
   1.14%  [kernel]                     [k]
__schedule
   1.11%  [kernel]                     [k]
_raw_spin_lock_irqsave
   0.98%  [kernel]                     [k]
apic_timer_interrupt
   0.86%  [kernel]                     [k]
copy_user_generic_string
   0.83%  [kernel]                     [k]
native_write_cr0
   0.76%  [kernel]                     [k]
sync_inodes_sb
   0.71%  [kernel]                     [k]
rcu_needs_cpu
   0.69%  libpthread-2.11.3.so         [.]
pthread_mutex_lock
   0.66%  [kernel]                     [k]
fput
   0.62%  [kernel]                     [k]
load_balance
   0.57%  [vdso]                       [.]
0x7fff3a976700
   0.56%  libc-2.11.3.so               [.]
memcpy
   0.56%  [kernel]                     [k]
reschedule_interrupt
   0.56%  [kernel]                     [k]
tg_load_down
   0.50%  [kernel]                     [k] iput

I see lots of new differences, but again don't know what to make of it
and what might be related or significant.  LevelDB seems to jump out
this time, amongst others.  Let me know if you need more info.

    Regards,

      Oliver

On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote:
Hi Oliver,

Glad that helped!  How much more efficient do the cuttlefish OSDs seem
at this point (with wip-dumpling-pglog-undirty)?  On modern Intel
platforms we were actually hoping to see CPU usage go down in many cases
due to the use of hardware CRC32 instructions.

Mark

On 08/19/2013 03:06 PM, Oliver Daudey wrote:
Hey Samuel,

Thanks!  I installed your version, repeated the same tests on my
test-cluster and the extra CPU-loading seems to have disappeared.  Then
I replaced one osd of my production-cluster with your modified version
and it's config-option and it seems to be a lot less CPU-hungry now.
Although the Cuttlefish-osds still seem to be even more CPU-efficient,
your changes have definitely helped a lot.  We seem to be looking in the
right direction, at least for this part of the problem.

BTW, I ran `perf top' on the production-node with your modified osd and
didn't see anything osd-related stand out on top.  "PGLog::undirty()"
was in there, but with much lower usage, right at the bottom of the
green part of the output.

Many thanks for your help so far!

     Regards,

       Oliver

On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
You're right, PGLog::undirty() looks suspicious.  I just pushed a
branch wip-dumpling-pglog-undirty with a new config
(osd_debug_pg_log_writeout) which if set to false will disable some
strictly debugging checks which occur in PGLog::undirty().  We haven't
actually seen these checks causing excessive cpu usage, so this may be
a red herring.
-Sam

On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
Hey Mark,

On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
On 08/17/2013 06:13 AM, Oliver Daudey wrote:
Hey all,

This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
created in the tracker.  Thought I would pass it through the list as
well, to get an idea if anyone else is running into it.  It may only
show under higher loads.  More info about my setup is in the bug-report
above.  Here goes:

I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
+MB/sec on simple linear writes to a file with `dd' inside a VM on this
cluster under regular load and the osds usually averaged 20-100%
CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
the osds shot up to 100% to 400% in `top' (multi-core system) and the
speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
complained that disk-access inside the VMs was significantly slower and
the backups of the RBD-store I was running, also got behind quickly.

After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
rest at 0.67 Dumpling, speed and load returned to normal. I have
repeated this performance-hit upon upgrade on a similar test-cluster
under no additional load at all. Although CPU-usage for the osds wasn't
as dramatic during these tests because there was no base-load from other
VMs, I/O-performance dropped significantly after upgrading during these
tests as well, and returned to normal after downgrading the osds.

I'm not sure what to make of it. There are no visible errors in the logs
and everything runs and reports good health, it's just a lot slower,
with a lot more CPU-usage.

Hi Oliver,

If you have access to the perf command on this system, could you try
running:

"sudo perf top"

And if that doesn't give you much,

"sudo perf record -g"

then:

"sudo perf report | less"

during the period of high CPU usage?  This will give you a call graph.
There may be symbols missing, but it might help track down what the OSDs
are doing.

Thanks for your help!  I did a couple of runs on my test-cluster,
loading it with writes from 3 VMs concurrently and measuring the results
at the first node with all 0.67 Dumpling-components and with the osds
replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
while, then copied anything that showed in red and green into this post.
Here are the results (sorry for the word-wraps):

First, with 0.61.7 osds:

   19.91%  [kernel]                    [k] intel_idle
   10.18%  [kernel]                    [k] _raw_spin_lock_irqsave
    6.79%  ceph-osd                    [.] ceph_crc32c_le
    4.93%  [kernel]                    [k]
default_send_IPI_mask_sequence_phys
    2.71%  [kernel]                    [k] copy_user_generic_string
    1.42%  libc-2.11.3.so              [.] memcpy
    1.23%  [kernel]                    [k] find_busiest_group
    1.13%  librados.so.2.0.0           [.] ceph_crc32c_le_intel
    1.11%  [kernel]                    [k] _raw_spin_lock
    0.99%  kvm                         [.] 0x1931f8
    0.92%  [igb]                       [k] igb_poll
    0.87%  [kernel]                    [k] native_write_cr0
    0.80%  [kernel]                    [k] csum_partial
    0.78%  [kernel]                    [k] __do_softirq
    0.63%  [kernel]                    [k] hpet_legacy_next_event
    0.53%  [ip_tables]                 [k] ipt_do_table
    0.50%  libc-2.11.3.so              [.] 0x74433

Second test, with 0.67 osds:

   18.32%  [kernel]                      [k] intel_idle
    7.58%  [kernel]                      [k] _raw_spin_lock_irqsave
    7.04%  ceph-osd                      [.] PGLog::undirty()
    4.39%  ceph-osd                      [.] ceph_crc32c_le_intel
    3.92%  [kernel]                      [k]
default_send_IPI_mask_sequence_phys
    2.25%  [kernel]                      [k] copy_user_generic_string
    1.76%  libc-2.11.3.so                [.] memcpy
    1.56%  librados.so.2.0.0             [.] ceph_crc32c_le_intel
    1.40%  libc-2.11.3.so                [.] vfprintf
    1.12%  libc-2.11.3.so                [.] 0x7217b
    1.05%  [kernel]                      [k] _raw_spin_lock
    1.01%  [kernel]                      [k] find_busiest_group
    0.83%  kvm                           [.] 0x193ab8
    0.80%  [kernel]                      [k] native_write_cr0
    0.76%  [kernel]                      [k] __do_softirq
    0.73%  libc-2.11.3.so                [.] _IO_default_xsputn
    0.70%  [kernel]                      [k] csum_partial
    0.68%  [igb]                         [k] igb_poll
    0.58%  [kernel]                      [k] hpet_legacy_next_event
    0.54%  [kernel]                      [k] __schedule

What jumps right out, is the "PGLog::undirty()", which doesn't show up
with 0.61.7 at all, but is an extra drag right at top-usage in 0.67.
Note that I didn't manage to fully load the test-cluster CPU-wise,
because of network-constraints and I don't want to take any extra risks
on the production-cluster and test it there, but it seems we found a
possible culprit.

Any ideas?  Thanks again!

     Regards,

        Oliver

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com