DR practice: "uuid != super.uuid" and csum error at blob offset 0x0

Mark Lehrer <lehrer@xxxxxxxxx> · Tue, 9 Jul 2019 08:02:21 -0600

My main question is this - is there a way to stop any replay or
journaling during OSD startup and bring up the pool/fs in read-only
mode?

Here is a description of what I'm seeing.  I have a Luminous cluster
with CephFS and 16 8TB SSDs, using size=3.

I had a problem with one of my SAS controllers, and now I have at
least 3 OSDs that refuse to start.  The hardware appears to be fine
now.

I have my essential data backed up, but there are a few files that I
wouldn't mind saving so I want to use this as disaster recovery
practice.

The two problems I am seeing are:

1) On two of OSDs, there is a startup replay error after successfully
replaying quite a few blocks:

2019-07-06 16:08:05.281063 7f6baec66e40 10 bluefs _replay 0x1543000:
stop: uuid c366a2d6-e221-98b3-59fe-0f324c9dac8e != super.uuid
263428d5-8963-4339-8815-92ab6067e7a4
2019-07-06 16:08:05.281064 7f6baec66e40 10 bluefs _replay log file
size was 0x1543000
2019-07-06 16:08:05.281085 7f6baec66e40 -1 bluefs _replay file with
link count 0: file(ino 1485 size 0x15f4c43 mtime 2019-07-04
20:39:39.387601 bdev 1 allocated 1600000 extents
[1:0x35771500000+100000,1:0x35771600000+100000,1:0x35771700000+100000,1:0x35771c00000+100000,1:0x35771d00000+100000,1:0x35772200000+100000,1:0x35772300000+100000,1:0x35772800000+100000,1:0x35772900000+100000,1:0x35772a00000+100000,1:0x35772b00000+100000,1:0x35772c00000+100000,1:0x35772d00000+100000,1:0x35772e00000+100000,1:0x35773300000+100000,1:0x35773400000+100000,1:0x35773500000+100000,1:0x35773600000+100000,1:0x35773700000+100000,1:0x35773800000+100000,1:0x35773900000+100000,1:0x35773a00000+100000])
2019-07-06 16:08:05.281093 7f6baec66e40 -1 bluefs mount failed to
replay log: (5) Input/output error

2) The following error happens on at least two OSDs:

2019-07-06 15:58:46.621008 7fdcee030e40 -1
bluestore(/var/lib/ceph/osd/ceph-74) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0x147db0c5, expected 0x8f052c9,
device location [0x10000~1000], logical extent 0x0~1000, object
#-1:7b3f43c4:::osd_superblock:0#

The system was archiving some unimportant files when it went down, so
I really don't care about any of the recent writes.

What are my recovery options here?  I was thinking that turning off
replaying and running in read-only mode would be feasible, but maybe
there are better options?

Thanks,
Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com