Disk Down Emergency

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear cephers,

I have an emergency on a rather small ceph cluster.

My cluster consists of 2 OSD nodes with 10 disks x4TB each and 3 monitor nodes.

The version of ceph running is Firefly v.0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)

The cluster originally was build with "Replicated size=2" and "Min size=1" with the attached crush map,
which in my understanding this replicates data across hosts.

The emergency comes from the violation of the golden rule: "Never use 2 replicas on a production cluster"

Unfortunately the customers never really understood well the risk and now that one disk is down I am in the middle and I must do everything in my power not to loose any data, thus I am requesting your assistance.

Here is the output of

$ ceph osd tree
# id	weight	type name	up/down	reweight
-1	72.6	root default
-2	36.3		host store1
0	3.63			osd.0	down	0	---> DISK DOWN
1	3.63			osd.1	up	1
2	3.63			osd.2	up	1
3	3.63			osd.3	up	1
4	3.63			osd.4	up	1
5	3.63			osd.5	up	1
6	3.63			osd.6	up	1
7	3.63			osd.7	up	1
8	3.63			osd.8	up	1
9	3.63			osd.9	up	1
-3	36.3		host store2
10	3.63			osd.10	up	1
11	3.63			osd.11	up	1
12	3.63			osd.12	up	1
13	3.63			osd.13	up	1
14	3.63			osd.14	up	1
15	3.63			osd.15	up	1
16	3.63			osd.16	up	1
17	3.63			osd.17	up	1
18	3.63			osd.18	up	1
19	3.63			osd.19	up	1

and here is the status of the cluster


# ceph health
HEALTH_WARN 497 pgs degraded; 549 pgs stuck unclean; recovery 51916/2552684 objects degraded (2.034%)


Althoug OSD.0 is shown as mounted it cannot be started (probably failed disk controller problem)

# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       251G  4.1G  235G   2% /
tmpfs            24G     0   24G   0% /dev/shm
/dev/sda1       239M  100M  127M  44% /boot
/dev/sdj1       3.7T  223G  3.5T   6% /var/lib/ceph/osd/ceph-8
/dev/sdh1       3.7T  205G  3.5T   6% /var/lib/ceph/osd/ceph-6
/dev/sdg1       3.7T  199G  3.5T   6% /var/lib/ceph/osd/ceph-5
/dev/sde1       3.7T  180G  3.5T   5% /var/lib/ceph/osd/ceph-3
/dev/sdi1       3.7T  187G  3.5T   6% /var/lib/ceph/osd/ceph-7
/dev/sdf1       3.7T  193G  3.5T   6% /var/lib/ceph/osd/ceph-4
/dev/sdd1       3.7T  212G  3.5T   6% /var/lib/ceph/osd/ceph-2
/dev/sdk1       3.7T  210G  3.5T   6% /var/lib/ceph/osd/ceph-9
/dev/sdb1 3.7T 164G 3.5T 5% /var/lib/ceph/osd/ceph-0 ---> This is the problematic OSD
/dev/sdc1       3.7T  183G  3.5T   5% /var/lib/ceph/osd/ceph-1



# service ceph start osd.0
find: `/var/lib/ceph/osd/ceph-0': Input/output error
/etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines mon.store1 osd.6 osd.9 osd.1 osd.4 osd.3 osd.2 osd.8 osd.5 osd.7 mds.store1 mon.store3, /var/lib/ceph defines mon.store1 osd.6 osd.9 osd.1 osd.4 osd.3 osd.2 osd.8 osd.5 osd.7 mds.store1)


I have found this: http://ceph.com/geen-categorie/admin-guide-replacing-a-failed-disk-in-a-ceph-cluster/

and I am looking for your guidance in order to properly perform all actions in order not to loose any data and keep the ones of the second copy.


Best regards,

G.
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host store1 {
	id -2		# do not change unnecessarily
	# weight 36.300
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 3.630
	item osd.1 weight 3.630
	item osd.2 weight 3.630
	item osd.3 weight 3.630
	item osd.4 weight 3.630
	item osd.5 weight 3.630
	item osd.6 weight 3.630
	item osd.7 weight 3.630
	item osd.8 weight 3.630
	item osd.9 weight 3.630
}
host store2 {
	id -3		# do not change unnecessarily
	# weight 36.300
	alg straw
	hash 0	# rjenkins1
	item osd.10 weight 3.630
	item osd.11 weight 3.630
	item osd.12 weight 3.630
	item osd.13 weight 3.630
	item osd.14 weight 3.630
	item osd.15 weight 3.630
	item osd.16 weight 3.630
	item osd.17 weight 3.630
	item osd.18 weight 3.630
	item osd.19 weight 3.630
}
root default {
	id -1		# do not change unnecessarily
	# weight 72.600
	alg straw
	hash 0	# rjenkins1
	item store1 weight 36.300
	item store2 weight 36.300
}

# rules
rule replicated_ruleset {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type host
	step emit
}

# end crush map
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux