Dear cephers,
I have an emergency on a rather small ceph cluster.
My cluster consists of 2 OSD nodes with 10 disks x4TB each and 3
monitor nodes.
The version of ceph running is Firefly v.0.80.9
(b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
The cluster originally was build with "Replicated size=2" and "Min
size=1" with the attached crush map,
which in my understanding this replicates data across hosts.
The emergency comes from the violation of the golden rule: "Never use 2
replicas on a production cluster"
Unfortunately the customers never really understood well the risk and
now that one disk is down I am in the middle and I must do everything in
my power not to loose any data, thus I am requesting your assistance.
Here is the output of
$ ceph osd tree
# id weight type name up/down reweight
-1 72.6 root default
-2 36.3 host store1
0 3.63 osd.0 down 0 ---> DISK DOWN
1 3.63 osd.1 up 1
2 3.63 osd.2 up 1
3 3.63 osd.3 up 1
4 3.63 osd.4 up 1
5 3.63 osd.5 up 1
6 3.63 osd.6 up 1
7 3.63 osd.7 up 1
8 3.63 osd.8 up 1
9 3.63 osd.9 up 1
-3 36.3 host store2
10 3.63 osd.10 up 1
11 3.63 osd.11 up 1
12 3.63 osd.12 up 1
13 3.63 osd.13 up 1
14 3.63 osd.14 up 1
15 3.63 osd.15 up 1
16 3.63 osd.16 up 1
17 3.63 osd.17 up 1
18 3.63 osd.18 up 1
19 3.63 osd.19 up 1
and here is the status of the cluster
# ceph health
HEALTH_WARN 497 pgs degraded; 549 pgs stuck unclean; recovery
51916/2552684 objects degraded (2.034%)
Althoug OSD.0 is shown as mounted it cannot be started (probably failed
disk controller problem)
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 251G 4.1G 235G 2% /
tmpfs 24G 0 24G 0% /dev/shm
/dev/sda1 239M 100M 127M 44% /boot
/dev/sdj1 3.7T 223G 3.5T 6% /var/lib/ceph/osd/ceph-8
/dev/sdh1 3.7T 205G 3.5T 6% /var/lib/ceph/osd/ceph-6
/dev/sdg1 3.7T 199G 3.5T 6% /var/lib/ceph/osd/ceph-5
/dev/sde1 3.7T 180G 3.5T 5% /var/lib/ceph/osd/ceph-3
/dev/sdi1 3.7T 187G 3.5T 6% /var/lib/ceph/osd/ceph-7
/dev/sdf1 3.7T 193G 3.5T 6% /var/lib/ceph/osd/ceph-4
/dev/sdd1 3.7T 212G 3.5T 6% /var/lib/ceph/osd/ceph-2
/dev/sdk1 3.7T 210G 3.5T 6% /var/lib/ceph/osd/ceph-9
/dev/sdb1 3.7T 164G 3.5T 5% /var/lib/ceph/osd/ceph-0 --->
This is the problematic OSD
/dev/sdc1 3.7T 183G 3.5T 5% /var/lib/ceph/osd/ceph-1
# service ceph start osd.0
find: `/var/lib/ceph/osd/ceph-0': Input/output error
/etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines
mon.store1 osd.6 osd.9 osd.1 osd.4 osd.3 osd.2 osd.8 osd.5 osd.7
mds.store1 mon.store3, /var/lib/ceph defines mon.store1 osd.6 osd.9
osd.1 osd.4 osd.3 osd.2 osd.8 osd.5 osd.7 mds.store1)
I have found this:
http://ceph.com/geen-categorie/admin-guide-replacing-a-failed-disk-in-a-ceph-cluster/
and I am looking for your guidance in order to properly perform all
actions in order not to loose any data and keep the ones of the second
copy.
Best regards,
G.
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host store1 {
id -2 # do not change unnecessarily
# weight 36.300
alg straw
hash 0 # rjenkins1
item osd.0 weight 3.630
item osd.1 weight 3.630
item osd.2 weight 3.630
item osd.3 weight 3.630
item osd.4 weight 3.630
item osd.5 weight 3.630
item osd.6 weight 3.630
item osd.7 weight 3.630
item osd.8 weight 3.630
item osd.9 weight 3.630
}
host store2 {
id -3 # do not change unnecessarily
# weight 36.300
alg straw
hash 0 # rjenkins1
item osd.10 weight 3.630
item osd.11 weight 3.630
item osd.12 weight 3.630
item osd.13 weight 3.630
item osd.14 weight 3.630
item osd.15 weight 3.630
item osd.16 weight 3.630
item osd.17 weight 3.630
item osd.18 weight 3.630
item osd.19 weight 3.630
}
root default {
id -1 # do not change unnecessarily
# weight 72.600
alg straw
hash 0 # rjenkins1
item store1 weight 36.300
item store2 weight 36.300
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com