Dear Robert,
Yes, you're right. The two OSDs removed of the PGs are from the same
host and contradict to my rules (that's a reason I removed them).
Unfortunately the partitions of the disk are all formatted so I
cannot recover the data.
However, the command "ceph pg force_create_pg <pg ID>" and
restarting the OSD daemons works to clean stale pgs. Now my ceph
health is OK and the rbd service can work normally.
Many thanks for your help,
FaHui
Robert LeBlanc 於 2015/4/24 上午 10:08 寫道:
What hosts were those OSDS on? I'm concerned that two
OSDS for some of the PGS were adjacent and if that placed them
on the same host, it would be contrary to your rules and
something deeper is wrong.
Did you format the disks that were taken out of the
cluster? Can you mount the partitions and see the files and
directories? If so, you can probably recover the data using the
tools from the recovery/dev tools.
You may be able to force create the missing PGS using
ceph force-create <pg.id>. This may or may not work, I
don't remember.
If you just don't care about losing data, you can
delete the pool and create a new one. This should work for sure,
but losses any data that you might have still had. If this pool
was full of RBD, then there is a high possibility that all of
your RBD images had chunks in the missing PGs. If you choose not
to try to restore the PGS using the tools, I'd be inclined to
delete the pool and restore from back up as to not be surprised
by data corruption in the images. Neither option is ideal or
quick.
Robert LeBlanc
Sent from a mobile device please excuse any typos.
On Apr 23, 2015 6:42 PM, "FaHui Lin" < fahui.lin@xxxxxxxxxx>
wrote:
Hi, thank you for your
response.
Well, I've not only taken out but also totally removed the
both OSDs (by "ceph osd rm" and delete everything in
/var/lib/ceph/osd/<related OSDs>) of that pg (and
similar to all other stale pgs.)
The main problem I have is those stale pgs (miss all OSDs
I've removed) not merely make ceph health warning, but other
machine cannot mount the ceph rbd as well.
Here's the full crush map. The OSDs I removed were
osd.5~19.
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 500
# devices
device 0 osd.0
device 1 device1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 device5
device 6 device6
device 7 device7
device 8 device8
device 9 device9
device 10 device10
device 11 device11
device 12 device12
device 13 device13
device 14 device14
device 15 device15
device 16 device16
device 17 device17
device 18 device18
device 19 device19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root
# buckets
host XX-ceph01 {
id -2 # do not change unnecessarily
# weight 160.040
alg straw
hash 0 # rjenkins1
item osd.0 weight 40.010
item osd.2 weight 40.010
item osd.3 weight 40.010
item osd.4 weight 40.010
}
host XX-ceph02 {
id -3 # do not change unnecessarily
# weight 320.160
alg straw
hash 0 # rjenkins1
item osd.20 weight 40.020
item osd.21 weight 40.020
item osd.22 weight 40.020
item osd.23 weight 40.020
item osd.24 weight 40.020
item osd.25 weight 40.020
item osd.26 weight 40.020
item osd.27 weight 40.020
}
root default {
id -1 # do not change unnecessarily
# weight 480.200
alg straw
hash 0 # rjenkins1
item XX-ceph01 weight 160.040
item XX-ceph02 weight 320.160
}
# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
List of some stale pgs:
pg_stat objects mip degr misp
unf bytes log disklog state
state_stamp v reported up
up_primary acting acting_primary
last_scrub scrub_stamp last_deep_scrub
deep_scrub_stamp
17.c6 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:09.358613 0'0
2706:216 [19,13] 19 [19,13] 19
0'0 2015-04-16 02:29:34.882038
0'0 2015-04-16 02:29:34.882038
17.c7 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:28.304621 0'0
2718:262 [15,18] 15 [15,18] 15
0'0 2015-04-20 09:15:39.363310
0'0 2015-04-20 09:15:39.363310
17.c1 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:01.073681 0'0
2706:199 [19,16] 19 [19,16] 19
0'0 2015-04-15 12:37:11.741251
0'0 2015-04-15 12:37:11.741251
17.de 0 0 0
0 0 0 0 0
stale+active+undersized+degraded 2015-04-20
23:41:29.436796 0'0 2718:267 [15]
15 [15] 15 0'0 2015-04-13
07:56:01.760824 0'0 2015-04-13
07:56:01.760824
17.da 0 0 0 0 0
0 0 0
stale+active+undersized+degraded 2015-04-20
23:41:50.001087 0'0 2718:232 [14]
14 [14] 14 0'0 2015-04-19
15:45:53.304596 0'0 2015-04-19
15:45:53.304596
17.d9 0 0 0 0 0
0 0 0
stale+active+undersized+degraded 2015-04-20
23:41:29.472983 0'0 2718:270 [14]
14 [14] 14 0'0 2015-04-16
01:55:44.183550 0'0 2015-04-16
01:55:44.183550
17.d7 0 0 0 0 0
0 0 0
stale+active+undersized+degraded 2015-04-20
23:41:53.839134 0'0 2718:68 [17] 17
[17] 17 0'0 2015-04-16
00:06:27.998210 0'0 2015-04-16
00:06:27.998210
17.d5 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:28.311352 0'0
2718:226 [18,17] 18 [18,17] 18
0'0 2015-04-15 20:52:33.372369
0'0 2015-04-15 20:52:33.372369
17.d0 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:24.850188 0'0
2718:213 [15,12] 15 [15,12] 15
0'0 2015-04-19 15:40:32.215234
0'0 2015-04-19 15:40:32.215234
17.d1 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:24.849996 0'0
2718:227 [15,12] 15 [15,12] 15
0'0 2015-04-15 19:03:38.137147
0'0 2015-04-15 19:03:38.137147
17.ae 0 0 0
0 0 0 0 0
stale+active+clean 2015-04-20
09:16:28.310506 0'0 2718:231 [18,12]
18 [18,12] 18 0'0 2015-04-16
02:23:35.031329
0'0 2015-04-16 02:23:35.031329
17.ac 0 0 0
0 0 0 0 0
stale+active+undersized+degraded 2015-04-20
23:41:50.002406 0'0 2718:66 [12] 12
[12] 12 0'0 2015-04-16
02:23:33.023476 0'0 2015-04-16
02:23:33.023476
17.aa 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:25.983034 0'0
2718:213 [15,14] 15 [15,14] 15
0'0 2015-04-19 15:32:38.896039
0'0 2015-04-19 15:32:38.896039
17.ab 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:24.836133 0'0
2718:260 [12,17] 12 [12,17] 12
0'0 2015-04-19 15:32:44.905707
0'0 2015-04-19 15:32:44.905707
17.a8 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:09.361319 0'0
2706:212 [19,13] 19 [19,13] 19
0'0 2015-04-16 02:23:32.026015
0'0 2015-04-16 02:23:32.026015
17.a6 0 0 0 0 0
0 0 0
stale+active+undersized+degraded 2015-04-20
23:41:50.002804 0'0 2718:96 [18] 18
[18] 18 0'0 2015-04-20
14:02:29.334181 0'0 2015-04-20
14:02:29.334181
17.a4 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:28.310707 0'0
2718:232 [18,17] 18 [18,17] 18
0'0 2015-04-16 02:22:12.018136
0'0 2015-04-16 02:22:12.018136
17.a2 0 0 0 0 0
0 0 0 stale+active+clean
2015-04-20 09:16:11.624952 0'0
2718:200 [15,17] 15 [15,17] 15
0'0 2015-04-15 10:42:37.880699
0'0 2015-04-15 10:42:37.880699
17.a0 0 0 0 0 0
0 0 0
stale+active+undersized+degraded 2015-04-20
23:41:29.469600 0'0 2718:66 [18] 18
[18] 18 0'0 2015-04-16
02:22:08.992748 0'0 2015-04-16
02:22:08.992748
OSDs of those pgs (either primary or secondary) are totally
gone, and I cannot find a way to repair them.
I've had another machince of new drive partitions, and I
tried to re-create OSDs I had removed on it, but that would
be osd.28, 29, etc. That's why I wondered how to change ID
number of an OSD.
Regardless of the data loss (which I think it's already
happened), I'd like to make the ceph service normal asap.
Is there anyway to deal with those stale pgs? (such as to
recreate the OSDs they need, or to inject exsisting OSDs to
those pgs, or even to kill those pgs?)
And since I'm not experienced, I may need more concrete
comments (i.e. approach with ceph commands). Many thaks for
your help.
Best Regards,
FaHui
Robert LeBlanc 於 2015/4/23 下午 10:53 寫道:
A full CRUSH dump would be helpful, as well
as knowing which OSDs you took out. If you didn't take
17 out as well as 15, then you might be OK. If the OSDs
still show up in your CRUSH, then try and remove them
from the CRSH map with 'ceph osd crush rm osd.15'.
If you took out both OSDs, you will need to use
some of the recovery tools. I believe the procedure is
roughly, mount the drive in another box, extract the
PGs needed, then shut down the primary OSD for that
PG, inject the PG into the OSD, then start it up and
it should replicate. I haven't done it myself
(probably something I should do in case I ever run
into the problem).
|