Thanks
Ronny.
I decided to try
to tar everything under current directory. Is this correct
command for it? Is there any directory we do not want in
the new drive? commit_op_seq, meta, nosnap, omap?
tar --xattrs
--preserve-permissions -zcvf osd.4.tar.gz .
As far
as inconsistent PGs... I am running in to these errors. I
tried moving one copy of pg to other location, but it just
says moved shard is missing. Tried setting 'noout ' and turn
one of them down, seems to work on something but then back to
same error. Currently trying to move to different osd...
making sure the drive is not faulty, got few of them.. but
still persisting.. I've been kicking off ceph pg repair PG#,
hoping it would fix them. =P Any other suggestion?
2017-09-20
09:39:48.481400 7f163c5fa700 0 log_channel(cluster) log [INF]
: 0.29 repair starts
2017-09-20
09:47:37.384921 7f163c5fa700 -1 log_channel(cluster) log [ERR]
: 0.29 shard 6: soid 0:97126ead:::200014ce4c3.0000028f:head
data_digest 0x8f679a50 != data_digest 0x979f2ed4 from auth oi
0:97126ead:::200014ce4c3.0000028f:head(19366'539375
client.535319.1:2361163 dirty|data_digest|omap_digest s
4194304 uv 539375 dd 979f2ed4 od ffffffff alloc_hint [0 0])
2017-09-20
09:47:37.384931 7f163c5fa700 -1 log_channel(cluster) log [ERR]
: 0.29 shard 7: soid 0:97126ead:::200014ce4c3.0000028f:head
data_digest 0x8f679a50 != data_digest 0x979f2ed4 from auth oi
0:97126ead:::200014ce4c3.0000028f:head(19366'539375
client.535319.1:2361163 dirty|data_digest|omap_digest s
4194304 uv 539375 dd 979f2ed4 od ffffffff alloc_hint [0 0])
2017-09-20
09:47:37.384936 7f163c5fa700 -1 log_channel(cluster) log [ERR]
: 0.29 soid 0:97126ead:::200014ce4c3.0000028f:head: failed to
pick suitable auth object
2017-09-20
09:48:11.138566 7f1639df5700 -1 log_channel(cluster) log [ERR]
: 0.29 shard 6: soid 0:97d5c15a:::100000101b4.00006892:head
data_digest 0xd65b4014 != data_digest 0xf41cfab8 from auth oi
0:97d5c15a:::100000101b4.00006892:head(12962'65557
osd.4.0:42234 dirty|data_digest|omap_digest s 4194304 uv 776
dd f41cfab8 od ffffffff alloc_hint [0 0])
2017-09-20
09:48:11.138575 7f1639df5700 -1 log_channel(cluster) log [ERR]
: 0.29 shard 7: soid 0:97d5c15a:::100000101b4.00006892:head
data_digest 0xd65b4014 != data_digest 0xf41cfab8 from auth oi
0:97d5c15a:::100000101b4.00006892:head(12962'65557
osd.4.0:42234 dirty|data_digest|omap_digest s 4194304 uv 776
dd f41cfab8 od ffffffff alloc_hint [0 0])
2017-09-20
09:48:11.138581 7f1639df5700 -1 log_channel(cluster) log [ERR]
: 0.29 soid 0:97d5c15a:::100000101b4.00006892:head: failed to
pick suitable auth object
2017-09-20
09:48:55.584022 7f1639df5700 -1 log_channel(cluster) log [ERR]
: 0.29 repair 4 errors, 0 fixed
Latest
health...
HEALTH_ERR
1 pgs are stuck inactive for more than 300 seconds; 1 pgs
down; 1 pgs incomplete; 9 pgs inconsistent; 1 pgs repair; 1
pgs stuck inactive; 1 pgs stuck unclean; 68 scrub errors;
mds rank 0 has failed; mds cluster is degraded; no legacy
OSD present but 'sortbitwise' flag is not set
Regards,
Hong
On
20.09.2017 16:49, hjcho616 wrote:
Anyone? Can
this page be saved? If not what are my
options?
Regards,
Hong
Looking
better... working on scrubbing..
HEALTH_ERR
1 pgs are stuck inactive for
more than 300 seconds; 1 pgs
incomplete; 12 pgs inconsistent;
2 pgs repair; 1 pgs stuck
inactive; 1 pgs stuck unclean;
109 scrub errors; too few PGs
per OSD (29 < min 30); mds
rank 0 has failed; mds cluster
is degraded; noout flag(s) set;
no legacy OSD present but
'sortbitwise' flag is not set
Now
PG1.28.. looking at all old osds
dead or alive. Only one with
DIR_* directory is in osd.4.
This appears to be metadata
pool! 21M of metadata can be
quite a bit of stuff.. so I
would like to rescue this! But
I am not able to start this OSD.
exporting through
ceph-objectstore-tool appears to
crash. Even with
--skip-journal-replay and
--skip-mount-omap (different
failure). As I mentioned in
earlier email, that exception
thrown message is bogus...
#
ceph-objectstore-tool --op
export --pgid 1.28 --data-path
/var/lib/ceph/osd/ceph-4
--journal-path
/var/lib/ceph/osd/ceph-4/journal
--file ~/1.28.export
terminate
called after throwing an
instance of 'std::domain_error'
[SNIP]
What
can I do to save that PG1.28?
Please let me know if you
need more information. So
close!... =)
12 inconsistent and 109 scrub errors is
something you should fix first of all.
also you can consider using the paid-services
of many ceph support companies. that specialize in
these kind of situations.
--
that beeing said, here are some suggestions...
when it comes to lost object recovery you have
come about as far as i have ever experienced. so
everything after here is just assumptions and wild
guesswork to what you can try. I hope others
shouts out if i tell you wildly wrong things.
if you have found date pg1.28 from the broken
osd and have checked all other working and
nonworking drives, for that pg. then you need to
try and extract the pg from the broken drive. As
always in recovery cases, take a dd clone of the
drive and work from the cloned image. to avoid
more damage to the drive, and to allow you to try
multiple times.
you should add a temporary injection drive
large enough for that pg, and set its crush weight
to 0 so it always drains. make sure it is up and
registered properly in ceph.
the idea is to copy the pg manually from
broken-osd to the injection drive, since the
export/import fails.. making sure you get all
xattrs included. one can either copy the whole
pg, or just the "missing" objects. if there are
few objects i would go for that, if there are many
i would take the whole pg. you wont get data from
leveldb. so i am not at all sure this would work.
but worth a shot.
- stop your injection osd, verify it is down
and the proccess not running.
- from the mountpoint of your broken-osd go into
the current directory. and tar up the pg1.28 make
sure you use -p and --xattrs when you create the
archive.
- if tar errors out on unreadable files, just rm
those (since you are working on a copy of your
rescue image, you can allways try again)
- copy the tar file to the injection drive and
extract while sitting in the current directory
(remember --xattrs)
- set debug options on the injection drive in
ceph.conf
- start the injection drive, and follow along in
the log file. hopefully it should scan, locate the
pg, and replicate the pg1.28 objects off to the
current primary drive for pg1.28. and since it
have crush weight 0 it should drain out.
- if that works, verify the injection drive is
drained, stop it and remove it from ceph. zap the
drive.
this is all as i said guesstimates so your
mileage may vary
good luck
Ronny Aasen