Re: ceph-osd pegging CPU on giant, no snapshots involved this time

Florian Haas <florian@xxxxxxxxxxx> · Wed, 18 Feb 2015 22:27:46 +0100

On Wed, Feb 18, 2015 at 9:32 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> On 02/18/2015 02:19 PM, Florian Haas wrote:
>>
>> Hey everyone,
>>
>> I must confess I'm still not fully understanding this problem and
>> don't exactly know where to start digging deeper, but perhaps other
>> users have seen this and/or it rings a bell.
>>
>> System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2
>> different rulesets where the problem applies to hosts and PGs using a
>> bog-standard default crushmap.
>>
>> Symptom: out of the blue, ceph-osd processes on a single OSD node
>> start going to 100% CPU utilization. The problems turns so bad that
>> the machine is effectively becoming CPU bound and can't cope with any
>> client requests anymore. Stopping and restarting all OSDs brings the
>> problem right back, as does rebooting the machine — right after
>> ceph-osd processes start, CPU utilization shoots up again. Stopping
>> and marking out several OSDs on the machine makes the problem go away
>> but obviously causes massive backfilling. All the logs show while CPU
>> utilization is implausibly high are slow requests (which would be
>> expected in a system that can barely do anything).
>>
>> Now I've seen issues like this before on dumpling and firefly, but
>> besides the fact that they have all been addressed and should now be
>> fixed, they always involved the prior mass removal of RBD snapshots.
>> This system only used a handful of snapshots in testing, and is
>> presently not using any snapshots at all.
>>
>> I'll be spending some time looking for clues in the log files of the
>> OSDs that were shut down which caused the problem to go away, but if
>> this sounds familiar to anyone willing to offer clues, I'd be more
>> than interested. :) Thanks!
>
>
> Hi Florian,
>
> Does a quick perf top tell you anything useful?

Hi Mark,

Unfortunately, quite the contrary -- but this might actually provide a
clue to the underlying issue.

So the CPU pegging issue isn't currently present, so the perf top data
wouldn't be conclusive until the issue is reproduced. But: merely
running perf top on this host, which currently only has 2 active OSDs,
renders the host unresponsive.

Corresponding dmesg snippet:

[Wed Feb 18 20:53:42 2015] hrtimer: interrupt took 2243820 ns
[Wed Feb 18 20:53:49 2015] ------------[ cut here ]------------
[Wed Feb 18 20:53:49 2015] WARNING: at
arch/x86/kernel/cpu/perf_event.c:1074 x86_pmu_start+0xc6/0x100()
[Wed Feb 18 20:53:49 2015] Modules linked in: ipmi_si binfmt_misc
mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
iptable_filter ip_tables xfs vfat fat iTCO_w
dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul
glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich
mfd_core mei_me mei ipmi_devintf
shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd
auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod
crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ttm bnx2
x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si]

[Wed Feb 18 20:53:49 2015] CPU: 0 PID: 12381 Comm: dsm_sa_datamgrd Not
tainted 3.10.0-123.20.1.el7.x86_64 #1
[Wed Feb 18 20:53:49 2015] Hardware name: Dell Inc. PowerEdge
R720xd/0020HJ, BIOS 2.2.2 01/16/2014
[Wed Feb 18 20:53:49 2015] 0000000000000000 0000000050de8931
ffff880fef003d40 ffffffff815e2b0c
[Wed Feb 18 20:53:49 2015] ffff880fef003d78 ffffffff8105dee1
ffff880c316a7400 ffff880fef00b9e0
[Wed Feb 18 20:53:49 2015] 0000000000000000 ffff880fef016db0
ffff880dbaa896c0 ffff880fef003d88
[Wed Feb 18 20:53:49 2015] Call Trace:
[Wed Feb 18 20:53:49 2015]  <IRQ>  [<ffffffff815e2b0c>] dump_stack+0x19/0x1b
[Wed Feb 18 20:53:49 2015] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
[Wed Feb 18 20:53:49 2015] [<ffffffff8105e00a>] warn_slowpath_null+0x1a/0x20
[Wed Feb 18 20:53:49 2015] [<ffffffff81023706>] x86_pmu_start+0xc6/0x100
[Wed Feb 18 20:53:49 2015] [<ffffffff81136128>]
perf_adjust_freq_unthr_context.part.79+0x198/0x1b0
[Wed Feb 18 20:53:49 2015] [<ffffffff811363d6>] perf_event_task_tick+0xb6/0xf0
[Wed Feb 18 20:53:49 2015] [<ffffffff810967e5>] scheduler_tick+0xd5/0x150
[Wed Feb 18 20:53:49 2015] [<ffffffff8106fe86>] update_process_times+0x66/0x80
[Wed Feb 18 20:53:49 2015] [<ffffffff810be055>]
tick_sched_handle.isra.16+0x25/0x60
[Wed Feb 18 20:53:49 2015] [<ffffffff810be0d1>] tick_sched_timer+0x41/0x60
[Wed Feb 18 20:53:49 2015] [<ffffffff81089a57>] __run_hrtimer+0x77/0x1d0
[Wed Feb 18 20:53:49 2015] [<ffffffff810be090>] ?
tick_sched_handle.isra.16+0x60/0x60
[Wed Feb 18 20:53:49 2015] [<ffffffff8108a297>] hrtimer_interrupt+0xf7/0x240
[Wed Feb 18 20:53:49 2015] [<ffffffff81039717>]
local_apic_timer_interrupt+0x37/0x60
[Wed Feb 18 20:53:49 2015] [<ffffffff815f552f>]
smp_apic_timer_interrupt+0x3f/0x60
[Wed Feb 18 20:53:49 2015] [<ffffffff815f3e9d>] apic_timer_interrupt+0x6d/0x80
[Wed Feb 18 20:53:49 2015]  <EOI>  [<ffffffff81092d46>] ?
finish_task_switch+0x56/0x170
[Wed Feb 18 20:53:49 2015] [<ffffffff815e7db5>] __schedule+0x2c5/0x790
[Wed Feb 18 20:53:49 2015] [<ffffffff815e82a9>] schedule+0x29/0x70
[Wed Feb 18 20:53:49 2015] [<ffffffff815e73b5>] do_nanosleep+0xc5/0x130
[Wed Feb 18 20:53:49 2015] [<ffffffff8108a78d>] hrtimer_nanosleep+0xad/0x170
[Wed Feb 18 20:53:49 2015] [<ffffffff81089720>] ? hrtimer_get_res+0x50/0x50
[Wed Feb 18 20:53:49 2015] [<ffffffff815e7361>] ? do_nanosleep+0x71/0x130
[Wed Feb 18 20:53:49 2015] [<ffffffff8108a8ce>] SyS_nanosleep+0x7e/0x90
[Wed Feb 18 20:53:49 2015] [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b
[Wed Feb 18 20:53:49 2015] ---[ end trace bbcc155358ec2b47 ]---
[Wed Feb 18 21:00:35 2015] ------------[ cut here ]------------
[Wed Feb 18 21:00:35 2015] WARNING: at
arch/x86/kernel/cpu/perf_event_intel.c:1205
intel_pmu_handle_irq+0x3f7/0x410()
[Wed Feb 18 21:00:35 2015] perfevents: irq loop stuck!
[Wed Feb 18 21:00:35 2015] Modules linked in: ipmi_si binfmt_misc
mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
iptable_filter ip_tables xfs vfat fat iTCO_w
dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul
glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich
mfd_core mei_me mei ipmi_devintf
shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd
auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod
crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ttm bnx2
x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si]

[Wed Feb 18 21:00:35 2015] CPU: 9 PID: 4608 Comm: ceph-osd Tainted: G
      W   --------------   3.10.0-123.20.1.el7.x86_64 #1
[Wed Feb 18 21:00:35 2015] Hardware name: Dell Inc. PowerEdge
R720xd/0020HJ, BIOS 2.2.2 01/16/2014
[Wed Feb 18 21:00:35 2015] ffff880fef125d10 000000000bc39f12
ffff880fef125cc8 ffffffff815e2b0c
[Wed Feb 18 21:00:35 2015] ffff880fef125d00 ffffffff8105dee1
0000000000000064 ffff880fef12b9e0
[Wed Feb 18 21:00:35 2015] ffff880c316a4400 0000000000000040
ffff880fef12bbe0 ffff880fef125d68
[Wed Feb 18 21:00:35 2015] Call Trace:
[Wed Feb 18 21:00:35 2015]  <NMI>  [<ffffffff815e2b0c>] dump_stack+0x19/0x1b
[Wed Feb 18 21:00:35 2015] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
[Wed Feb 18 21:00:35 2015] [<ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80
[Wed Feb 18 21:00:35 2015] [<ffffffff810294c7>] intel_pmu_handle_irq+0x3f7/0x410
[Wed Feb 18 21:00:35 2015] [<ffffffff81356ff4>] ?
ghes_copy_tofrom_phys+0x124/0x210
[Wed Feb 18 21:00:35 2015] [<ffffffff8101a603>] ? native_sched_clock+0x13/0x80
[Wed Feb 18 21:00:35 2015] [<ffffffff8101a679>] ? sched_clock+0x9/0x10
[Wed Feb 18 21:00:35 2015] [<ffffffff8109a045>] ? sched_clock_cpu+0xb5/0x100
[Wed Feb 18 21:00:35 2015] [<ffffffff815ec08b>] perf_event_nmi_handler+0x2b/0x50
[Wed Feb 18 21:00:35 2015] [<ffffffff815eb7d9>] nmi_handle.isra.0+0x69/0xb0
[Wed Feb 18 21:00:35 2015] [<ffffffff815eb989>] do_nmi+0x169/0x340
[Wed Feb 18 21:00:35 2015] [<ffffffff815eac31>] end_repeat_nmi+0x1e/0x2e
[Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10
[Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10
[Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? native_write_msr_safe+0xa/0x10
[Wed Feb 18 21:00:35 2015]  <<EOE>>  <IRQ>  [<ffffffff81029bc1>]
intel_pmu_disable_event+0xa1/0x160
[Wed Feb 18 21:00:35 2015] [<ffffffff81022956>] x86_pmu_stop+0x76/0xd0
[Wed Feb 18 21:00:35 2015] [<ffffffff81136094>]
perf_adjust_freq_unthr_context.part.79+0x104/0x1b0
[Wed Feb 18 21:00:35 2015] [<ffffffff811363d6>] perf_event_task_tick+0xb6/0xf0
[Wed Feb 18 21:00:35 2015] [<ffffffff810967e5>] scheduler_tick+0xd5/0x150
[Wed Feb 18 21:00:35 2015] [<ffffffff8106fe86>] update_process_times+0x66/0x80
[Wed Feb 18 21:00:35 2015] [<ffffffff810be055>]
tick_sched_handle.isra.16+0x25/0x60
[Wed Feb 18 21:00:35 2015] [<ffffffff810be0d1>] tick_sched_timer+0x41/0x60
[Wed Feb 18 21:00:35 2015] [<ffffffff81089a57>] __run_hrtimer+0x77/0x1d0
[Wed Feb 18 21:00:35 2015] [<ffffffff810be090>] ?
tick_sched_handle.isra.16+0x60/0x60
[Wed Feb 18 21:00:35 2015] [<ffffffff8108a297>] hrtimer_interrupt+0xf7/0x240
[Wed Feb 18 21:00:35 2015] [<ffffffff81039717>]
local_apic_timer_interrupt+0x37/0x60
[Wed Feb 18 21:00:35 2015] [<ffffffff815f552f>]
smp_apic_timer_interrupt+0x3f/0x60
[Wed Feb 18 21:00:35 2015] [<ffffffff815f3e9d>] apic_timer_interrupt+0x6d/0x80
[Wed Feb 18 21:00:35 2015]  <EOI>  [<ffffffff811ca796>] ?
file_update_time+0xe6/0xf0
[Wed Feb 18 21:00:35 2015] [<ffffffff811ca70f>] ? file_update_time+0x5f/0xf0
[Wed Feb 18 21:00:35 2015] [<ffffffff811437a8>]
__generic_file_aio_write+0x198/0x3e0
[Wed Feb 18 21:00:35 2015] [<ffffffff81143a4d>] generic_file_aio_write+0x5d/0xc0
[Wed Feb 18 21:00:35 2015] [<ffffffffa02a51a9>]
ext4_file_write+0xa9/0x450 [ext4]
[Wed Feb 18 21:00:35 2015] [<ffffffff810976b6>] ? try_to_wake_up+0x1b6/0x280
[Wed Feb 18 21:00:35 2015] [<ffffffff810bfd68>] ? get_futex_key+0x1c8/0x2a0
[Wed Feb 18 21:00:35 2015] [<ffffffff811af0cd>] do_sync_write+0x8d/0xd0
[Wed Feb 18 21:00:35 2015] [<ffffffff811af86d>] vfs_write+0xbd/0x1e0
[Wed Feb 18 21:00:35 2015] [<ffffffff811b02b8>] SyS_write+0x58/0xb0
[Wed Feb 18 21:00:35 2015] [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b
[Wed Feb 18 21:00:35 2015] ---[ end trace bbcc155358ec2b48 ]---

[Wed Feb 18 21:00:35 2015] CPU#9: ctrl:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9: status:  0000000000000001
[Wed Feb 18 21:00:35 2015] CPU#9: overflow:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9: fixed: 00000000000000b0
[Wed Feb 18 21:00:35 2015] CPU#9: pebs:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9: active:  0000000200000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC0 ctrl:  000000000013003c
[Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC0 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC0 left:  0000000000007a1c
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC1 ctrl:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC1 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC1 left:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC2 ctrl:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC2 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC2 left:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC3 ctrl:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:  gen-PMC3 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9:   gen-PMC3 left:  0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC0 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC1 count: 0000fffbe795d19b
[Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC2 count: 0000000000000000
[Wed Feb 18 21:00:35 2015] perf_event_intel: clearing PMU state on CPU#9
[Wed Feb 18 21:00:35 2015] perf samples too long (64102 > 62500),
lowering kernel.perf_event_max_sample_rate to 2000

In case you're wondering, no the ext4 file write does not come from an
OSD. All OSDs use XFS, and the node does not host a mon either.

The kernel build is the latest currently available for CentOS 7, with
matching perf obviously.

Does that help at all?

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com