Re: CEPH All OSD got segmentation fault after CRUSH edit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Can you attach the OSDMap (ceph osd getmap -o <mapfile>)?
-Sam

On Tue, Apr 26, 2016 at 2:07 AM, Henrik Svensson <henrik.svensson@xxxxxxxxxx> wrote:
Hi!

We got a three node CEPH cluster with 10 OSD each.

We bought 3 new machines with additional 30 disks that should reside in another location.
Before adding these machines we modified the default CRUSH table.

After modifying the (default) crush table with these commands the cluster went down:

————————————————
ceph osd crush add-bucket dc1 datacenter
ceph osd crush add-bucket dc2 datacenter
ceph osd crush add-bucket availo datacenter
ceph osd crush move dc1 root=default
ceph osd crush move lkpsx0120 root=default datacenter=dc1
ceph osd crush move lkpsx0130 root=default datacenter=dc1
ceph osd crush move lkpsx0140 root=default datacenter=dc1
ceph osd crush move dc2 root=default
ceph osd crush move availo root=default
ceph osd crush add-bucket sectra root
ceph osd crush move dc1 root=sectra
ceph osd crush move dc2 root=sectra
ceph osd crush move dc3 root=sectra
ceph osd crush move availo root=sectra
ceph osd crush remove default
————————————————

I tried to revert the CRUSH map but no luck:

————————————————
ceph osd crush add-bucket default root
ceph osd crush move lkpsx0120 root=default 
ceph osd crush move lkpsx0130 root=default 
ceph osd crush move lkpsx0140 root=default 
ceph osd crush remove sectra
————————————————

After trying to restart the cluster (and even the machines) no OSD started up again.
But ceph osd tree gave this output, stating certain OSD:s are up (but the processes are not running):

————————————————
# id weight type name up/down reweight
-1 163.8 root default
-2 54.6 host lkpsx0120
0 5.46 osd.0 down 0
1 5.46 osd.1 down 0
2 5.46 osd.2 down 0
3 5.46 osd.3 down 0
4 5.46 osd.4 down 0
5 5.46 osd.5 down 0
6 5.46 osd.6 down 0
7 5.46 osd.7 down 0
8 5.46 osd.8 down 0
9 5.46 osd.9 down 0
-3 54.6 host lkpsx0130
10 5.46 osd.10 down 0
11 5.46 osd.11 down 0
12 5.46 osd.12 down 0
13 5.46 osd.13 down 0
14 5.46 osd.14 down 0
15 5.46 osd.15 down 0
16 5.46 osd.16 down 0
17 5.46 osd.17 down 0
18 5.46 osd.18 up 1
19 5.46 osd.19 up 1
-4 54.6 host lkpsx0140
20 5.46 osd.20 up 1
21 5.46 osd.21 down 0
22 5.46 osd.22 down 0
23 5.46 osd.23 down 0
24 5.46 osd.24 down 0
25 5.46 osd.25 up 1
26 5.46 osd.26 up 1
27 5.46 osd.27 up 1
28 5.46 osd.28 up 1
29 5.46 osd.29 up 1
————————————————

The monitor starts/restarts OK (only one monitor exists).
But when starting one OSD with ceph -w nothing shows.

Here is the ceph mon_status:

————————————————
{ "name": "lkpsx0120",
  "rank": 0,
  "state": "leader",
  "election_epoch": 1,
  "quorum": [
        0],
  "outside_quorum": [],
  "extra_probe_peers": [],
  "sync_provider": [],
  "monmap": { "epoch": 4,
      "fsid": "9244194a-5e10-47ae-9287-507944612f95",
      "modified": "0.000000",
      "created": "0.000000",
      "mons": [
            { "rank": 0,
              "name": "lkpsx0120",
              "addr": "10.15.2.120:6789\/0"}]}}
————————————————

Here is the ceph.conf file

————————————————
[global]
fsid = 9244194a-5e10-47ae-9287-507944612f95
mon_initial_members = lkpsx0120
mon_host = 10.15.2.120
#debug osd = 20
#debug ms = 1
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_crush_chooseleaf_type = 1
osd_pool_default_size = 2
public_network = 10.15.2.0/24
cluster_network = 10.15.4.0/24
rbd_cache = true
rbd_cache_size = 67108864
rbd_cache_max_dirty = 50331648
rbd_cache_target_dirty = 33554432
rbd_cache_max_dirty_age = 2
rbd_cache_writethrough_until_flush = true
————————————————

Here is the decompiled crush map:

————————————————
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host lkpsx0120 {
id -2 # do not change unnecessarily
# weight 54.600
alg straw
hash 0 # rjenkins1
item osd.0 weight 5.460
item osd.1 weight 5.460
item osd.2 weight 5.460
item osd.3 weight 5.460
item osd.4 weight 5.460
item osd.5 weight 5.460
item osd.6 weight 5.460
item osd.7 weight 5.460
item osd.8 weight 5.460
item osd.9 weight 5.460
}
host lkpsx0130 {
id -3 # do not change unnecessarily
# weight 54.600
alg straw
hash 0 # rjenkins1
item osd.10 weight 5.460
item osd.11 weight 5.460
item osd.12 weight 5.460
item osd.13 weight 5.460
item osd.14 weight 5.460
item osd.15 weight 5.460
item osd.16 weight 5.460
item osd.17 weight 5.460
item osd.18 weight 5.460
item osd.19 weight 5.460
}
host lkpsx0140 {
id -4 # do not change unnecessarily
# weight 54.600
alg straw
hash 0 # rjenkins1
item osd.20 weight 5.460
item osd.21 weight 5.460
item osd.22 weight 5.460
item osd.23 weight 5.460
item osd.24 weight 5.460
item osd.25 weight 5.460
item osd.26 weight 5.460
item osd.27 weight 5.460
item osd.28 weight 5.460
item osd.29 weight 5.460
}
root default {
id -1 # do not change unnecessarily
# weight 163.800
alg straw
hash 0 # rjenkins1
item lkpsx0120 weight 54.600
item lkpsx0130 weight 54.600
item lkpsx0140 weight 54.600
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map
————————————————

Operating system is Debian 8.0 and the CEPH version is 0.80.7 as stated in the crash log.

We increased the log level and tried to start osd.1 as an example. All OSD:s we tried to start experiencing the same problem and dies.

The log file from OSD 1 (ceph-osd.1.log) can be found here: https://www.dropbox.com/s/dqunlufh0qtked5/ceph-osd.1.log.zip?dl=0

As of now, all systems are down including the KVM-cluster that are dependent of CEPH.

Best regards, 
Med vänlig hälsning 

Henrik

Henrik Svensson
OpIT
Sectra AB
 
Teknikringen 20, 58330 Linköping, Sweden
E-mail: henrik.svensson@xxxxxxxxxx
 
Phone: +46 (0)13 352 884
Cellular: +46 (0)70 395141
Web: 
www.sectra.com



This message is intended only for the addressee and may contain information that is
confidential or privileged. Unauthorized use is strictly prohibited and may be unlawful.

If you are not the addressee, you should not read, copy, disclose or otherwise use this
message, except for the purpose of delivery to the addressee. If you have received
this in error, please delete and advise us immediately.




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux