Crash and question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello everybody,

I’m running since 4 months a ceph cluster configured with two monitors :

1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for system
1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for system

This night I’ve encountered an issue with the crash of the first host.

My first question is why with 1 host down, all my cluster was down (unable to do ceph status — hang command) and all my rbd was stuck without possibility to R/W.
I rebooted the first host, and 2 hours later the second go down with the same issue (all rbd down and ceph hang).

After reboot, here is ceph status :

# ceph status
    cluster 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f
     health HEALTH_ERR
            3 pgs inconsistent
            1 pgs peering
            1 pgs stuck inactive
            1 pgs stuck unclean
            36 requests are blocked > 32 sec
            928 scrub errors
            clock skew detected on mon.drt-becks
     monmap e1: 2 mons at {drt-becks=172.16.21.6:6789/0,drt-marco=172.16.21.4:6789/0}
            election epoch 26, quorum 0,1 drt-marco,drt-becks
     osdmap e961: 24 osds: 24 up, 24 in
      pgmap v2532968: 400 pgs, 1 pools, 512 GB data, 130 kobjects
            1039 GB used, 88092 GB / 89177 GB avail
                 393 active+clean
                   3 active+clean+scrubbing+deep
                   3 active+clean+inconsistent
                   1 peering
  client io 57290 B/s wr, 7 op/s

Also I found this error on DMESG about the crash :

Message from syslogd@drt-marco at Jul 30 04:03:57 ...
 kernel:[4876519.657178] BUG: soft lockup - CPU#7 stuck for 22s! [btrfs-cleaner:32713]

All my volumes are on BTRFS, maybe it was not a good idea ?

Thanks a lot for your help, on the bottom more hardware information

K

# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
stepping	: 5
microcode	: 0x19
cpu MHz		: 2133.433
cache size	: 4096 KB
physical id	: 1
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 16
initial apicid	: 16
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4266.86
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
stepping	: 5
microcode	: 0x19
cpu MHz		: 2133.433
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4266.74
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
stepping	: 5
microcode	: 0x19
cpu MHz		: 2133.433
cache size	: 4096 KB
physical id	: 1
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 18
initial apicid	: 18
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4266.86
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
stepping	: 5
microcode	: 0x19
cpu MHz		: 2133.433
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4266.74
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
stepping	: 5
microcode	: 0x19
cpu MHz		: 2133.433
cache size	: 4096 KB
physical id	: 1
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 20
initial apicid	: 20
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4266.86
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
stepping	: 5
microcode	: 0x19
cpu MHz		: 2133.433
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4266.74
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
stepping	: 5
microcode	: 0x19
cpu MHz		: 2133.433
cache size	: 4096 KB
physical id	: 1
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 22
initial apicid	: 22
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4266.86
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
stepping	: 5
microcode	: 0x19
cpu MHz		: 2133.433
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4266.74
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:


# cat /etc/ceph/ceph.conf 
[global]
fsid = 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f
mon_initial_members = drt-becks, drt-marco
mon_host = 172.16.21.6, 172.16.21.4
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 172.16.21.0/24
cluster_network = 172.16.32.0/24

[osd]
osd_journal_size = 10000
filestore_xattr_use_omap = true
osd_mkfs_type = btrfs
osd_mkfs_options_btrfs = -f -l 16k -n 16k
osd_mount_options_btrfs = noatime
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 400
osd_pool_default_pgp_num = 400
osd_crush_chooseleaf_type = 1

# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        11G  2.8G  7.6G  28% /
udev             10M     0   10M   0% /dev
tmpfs           3.2G  9.2M  3.2G   1% /run
tmpfs           7.9G  8.0K  7.9G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/sda3       205G   52M  203G   1% /opt
/dev/sdm1       3.7T   64G  3.6T   2% /var/lib/ceph/osd/ceph-23
/dev/sdj1       3.7T   41G  3.6T   2% /var/lib/ceph/osd/ceph-17
/dev/sdc1       3.7T   37G  3.6T   1% /var/lib/ceph/osd/ceph-4
/dev/sdb1       3.7T   45G  3.6T   2% /var/lib/ceph/osd/ceph-1
tmpfs           1.6G     0  1.6G   0% /run/user/0
/dev/sdh1       3.7T   43G  3.6T   2% /var/lib/ceph/osd/ceph-13
/dev/sdd1       3.7T   39G  3.6T   2% /var/lib/ceph/osd/ceph-5
/dev/sde1       3.7T   42G  3.6T   2% /var/lib/ceph/osd/ceph-7
/dev/sdl1       3.7T   53G  3.6T   2% /var/lib/ceph/osd/ceph-21
/dev/sdf1       3.7T   36G  3.6T   1% /var/lib/ceph/osd/ceph-9
/dev/sdk1       3.7T   41G  3.6T   2% /var/lib/ceph/osd/ceph-19
/dev/sdi1       3.7T   44G  3.6T   2% /var/lib/ceph/osd/ceph-15
/dev/sdg1       3.7T   41G  3.6T   2% /var/lib/ceph/osd/ceph-11

# cat /etc/sysctl.conf
#
# /etc/sysctl.conf - Configuration file for setting system variables
# See /etc/sysctl.d/ for additonal system variables
# See sysctl.conf (5) for information.
#

net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

net.core.somaxconn = 1024
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 10

net.ipv4.tcp_slow_start_after_idle = 0

net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192

net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0

kernel.pid_max = 4194303





_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux