Hi Patrick,
this was all on version 17.2.7. The mon store had to be rebuilt from
OSDs, so the MDS map got lost. After recovering the ceph cluster
itself we inspected the journal and it reported missing objects before
we continued with the disaster recovery, this was the output (I had
posted it in a different thread in the ceph-users mailing list before
creating this thread):
---snip---
cephfs-journal-tool --rank=storage:0 --journal=mdlog journal inspect
023-12-08T15:35:22.922+0200 7f834d0320c0 -1 Missing object 200.000527c4
2023-12-08T15:35:22.938+0200 7f834d0320c0 -1 Bad entry start ptr
(0x149f140067f) at 0x149f1174595
2023-12-08T15:35:22.942+0200 7f834d0320c0 -1 Bad entry start ptr
(0x149f1400e66) at 0x149f1174d7c
2023-12-08T15:35:22.954+0200 7f834d0320c0 -1 Bad entry start ptr
(0x149f1401642) at 0x149f1175558
2023-12-08T15:35:22.970+0200 7f834d0320c0 -1 Bad entry start ptr
(0x149f1401e29) at 0x149f1175d3f
2023-12-08T15:35:22.974+0200 7f834d0320c0 -1 Bad entry start ptr
(0x149f1402610) at 0x149f1176526
2023-12-08T15:35:22.978+0200 7f834d0320c0 -1 Missing object 200.000527ca
2023-12-08T15:35:22.978+0200 7f834d0320c0 -1 Missing object 200.000527cb
2023-12-08T15:35:22.994+0200 7f834d0320c0 -1 Bad entry start ptr
(0x149f30008f4) at 0x149f2d7480a
2023-12-08T15:35:22.998+0200 7f834d0320c0 -1 Bad entry start ptr
(0x149f3000ced) at 0x149f2d74c03
Overall journal integrity: DAMAGED
Objects missing:
0x527c4
0x527ca
0x527cb
Corrupt regions:
0x149f0d73f16-149f1174595
0x149f1174595-149f1174d7c
0x149f1174d7c-149f1175558
0x149f1175558-149f1175d3f
0x149f1175d3f-149f1176526
0x149f1176526-149f2d7480a
0x149f2d7480a-149f2d74c03
0x149f2d74c03-ffffffffffffffff
cephfs-journal-tool --rank=storage:0 --journal=purge_queue journal inspect
2023-12-08T15:35:57.691+0200 7f331621e0c0 -1 Missing object 500.00000dc6
Overall journal integrity: DAMAGED
Objects missing:
0xdc6
Corrupt regions:
0x3718522e9-ffffffffffffffff
---snip---
Zitat von Patrick Donnelly <pdonnell@xxxxxxxxxx>:
On Mon, Dec 11, 2023 at 6:38 AM Eugen Block <eblock@xxxxxx> wrote:
Hi,
I'm trying to help someone with a broken CephFS. We managed to recover
basic ceph functionality but the CephFS is still inaccessible
(currently read-only). We went through the disaster recovery steps but
to no avail. Here's a snippet from the startup logs:
---snip---
mds.0.41 Booting: 2: waiting for purge queue recovered
mds.0.journaler.pq(ro) _finish_probe_end write_pos = 14797504512
(header had 14789452521). recovered.
mds.0.purge_queue operator(): open complete
mds.0.purge_queue operator(): recovering write_pos
monclient: get_auth_request con 0x55c280bc5c00 auth_method 0
monclient: get_auth_request con 0x55c280ee0c00 auth_method 0
mds.0.journaler.pq(ro) _finish_read got error -2
mds.0.purge_queue _recover: Error -2 recovering write_pos
mds.0.purge_queue _go_readonly: going readonly because internal IO
failed: No such file or directory
mds.0.journaler.pq(ro) set_readonly
mds.0.41 unhandled write error (2) No such file or directory, force
readonly...
mds.0.cache force file system read-only
force file system read-only
---snip---
I've added the dev mailing list, maybe someone can give some advice
how to continue from here (we could try to recover with an empty
metadata pool). Or is this FS lost?
Looks like one of the purge queue journal objects was lost? Were other
objects lost? It would be helpful to know more about the circumstances
of this "broken CephFS"? What Ceph version?
--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx