(Sorry, sometimes I use the wrong shortcuts too quick)
Hi experts,
I need your help. I have a running cluster with 19 OSDs and 3 MONs. I
created a separate LVM for /var/lib/ceph on one of the nodes. I
stopped the mon service on that node, rsynced the content to the newly
created LVM and restarted the monitor, but obviously, I didn't do that
correctly as I'm stuck in ERROR state and can't repair the respective
PGs.
How would I do that correctly? I want to do the same on the remaining
nodes, but without bringing the cluster to error state.
One thing I already learned is to set the noout flag before stopping
services, but what else is there to do to accomplish that?
But now that it is in error state, how can I repair my cluster? the
current status is:
---cut here---
ceph@node01:~/ceph-deploy> ceph -s
cluster 655cb05a-435a-41ba-83d9-8549f7c36167
health HEALTH_ERR
16 pgs inconsistent
261 scrub errors
monmap e7: 3 mons at
{mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0}
election epoch 356, quorum 0,1,2 mon1,mon2,mon3
osdmap e3394: 19 osds: 19 up, 19 in
pgmap v7105355: 8432 pgs, 15 pools, 1003 GB data, 205 kobjects
2114 GB used, 6038 GB / 8153 GB avail
8413 active+clean
16 active+clean+inconsistent
3 active+clean+scrubbing+deep
client io 0 B/s rd, 136 kB/s wr, 34 op/s
ceph@ndesan01:~/ceph-deploy> ceph health detail
HEALTH_ERR 16 pgs inconsistent; 261 scrub errors
pg 1.ffa is active+clean+inconsistent, acting [16,5]
pg 1.cc9 is active+clean+inconsistent, acting [5,18]
pg 1.bb1 is active+clean+inconsistent, acting [15,5]
pg 1.ac4 is active+clean+inconsistent, acting [0,5]
pg 1.a46 is active+clean+inconsistent, acting [13,4]
pg 1.a16 is active+clean+inconsistent, acting [5,18]
pg 1.9e4 is active+clean+inconsistent, acting [13,9]
pg 1.9b7 is active+clean+inconsistent, acting [5,6]
pg 1.950 is active+clean+inconsistent, acting [0,9]
pg 1.6db is active+clean+inconsistent, acting [15,5]
pg 1.5f6 is active+clean+inconsistent, acting [17,5]
pg 1.5c2 is active+clean+inconsistent, acting [8,4]
pg 1.5bc is active+clean+inconsistent, acting [9,6]
pg 1.505 is active+clean+inconsistent, acting [16,9]
pg 1.3e6 is active+clean+inconsistent, acting [2,4]
pg 1.32 is active+clean+inconsistent, acting [18,5]
261 scrub errors
---cut here---
And the number of scrub errors is increasing, although I started with
more thatn 400 scrub errors.
What I have tried is to manually repair single PGs as described in
[1]. But some of the broken PGs have no entries in the log file so I
don't have anything to look at.
In case there is one object in one OSD but is missing in the other.
how do I get that copied back there? Everything I've tried so far
didn't accomplish anything except the decreasing number of scrub
errors, but they are increasing again, so no success at all.
I'd be really greatful for your advice!
Regards,
Eugen
[1] http://ceph.com/planet/ceph-manually-repair-object/
--
Eugen Block voice : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail : eblock@xxxxxx
Vorsitzende des Aufsichtsrates: Angelika Mozdzen
Sitz und Registergericht: Hamburg, HRB 90934
Vorstand: Jens-U. Mozdzen
USt-IdNr. DE 814 013 983
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com