Re: mds.0.journaler.pq(ro) _finish_read got error -2

Eugen Block <eblock@xxxxxx> · Tue, 12 Dec 2023 17:06:33 +0000

Hi Patrick,

this was all on version 17.2.7. The mon store had to be rebuilt from  
OSDs, so the MDS map got lost. After recovering the ceph cluster  
itself we inspected the journal and it reported missing objects before  
we continued with the disaster recovery, this was the output (I had  
posted it in a different thread in the ceph-users mailing list before  
creating this thread):

---snip---
cephfs-journal-tool --rank=storage:0 --journal=mdlog journal inspect
023-12-08T15:35:22.922+0200 7f834d0320c0 -1 Missing object 200.000527c4
2023-12-08T15:35:22.938+0200 7f834d0320c0 -1 Bad entry start ptr  
(0x149f140067f) at 0x149f1174595
2023-12-08T15:35:22.942+0200 7f834d0320c0 -1 Bad entry start ptr  
(0x149f1400e66) at 0x149f1174d7c
2023-12-08T15:35:22.954+0200 7f834d0320c0 -1 Bad entry start ptr  
(0x149f1401642) at 0x149f1175558
2023-12-08T15:35:22.970+0200 7f834d0320c0 -1 Bad entry start ptr  
(0x149f1401e29) at 0x149f1175d3f
2023-12-08T15:35:22.974+0200 7f834d0320c0 -1 Bad entry start ptr  
(0x149f1402610) at 0x149f1176526
2023-12-08T15:35:22.978+0200 7f834d0320c0 -1 Missing object 200.000527ca
2023-12-08T15:35:22.978+0200 7f834d0320c0 -1 Missing object 200.000527cb
2023-12-08T15:35:22.994+0200 7f834d0320c0 -1 Bad entry start ptr  
(0x149f30008f4) at 0x149f2d7480a
2023-12-08T15:35:22.998+0200 7f834d0320c0 -1 Bad entry start ptr  
(0x149f3000ced) at 0x149f2d74c03

Overall journal integrity: DAMAGED
Objects missing:
  0x527c4
  0x527ca
  0x527cb
Corrupt regions:
  0x149f0d73f16-149f1174595
  0x149f1174595-149f1174d7c
  0x149f1174d7c-149f1175558
  0x149f1175558-149f1175d3f
  0x149f1175d3f-149f1176526
  0x149f1176526-149f2d7480a
  0x149f2d7480a-149f2d74c03
  0x149f2d74c03-ffffffffffffffff

cephfs-journal-tool --rank=storage:0 --journal=purge_queue journal inspect
2023-12-08T15:35:57.691+0200 7f331621e0c0 -1 Missing object 500.00000dc6

Overall journal integrity: DAMAGED
Objects missing:
  0xdc6
Corrupt regions:
  0x3718522e9-ffffffffffffffff
---snip---

Zitat von Patrick Donnelly <pdonnell@xxxxxxxxxx>:

On Mon, Dec 11, 2023 at 6:38 AM Eugen Block <eblock@xxxxxx> wrote:

Hi,

I'm trying to help someone with a broken CephFS. We managed to recover
basic ceph functionality but the CephFS is still inaccessible
(currently read-only). We went through the disaster recovery steps but
to no avail. Here's a snippet from the startup logs:

---snip---
mds.0.41 Booting: 2: waiting for purge queue recovered
mds.0.journaler.pq(ro) _finish_probe_end write_pos = 14797504512
(header had 14789452521). recovered.
mds.0.purge_queue operator(): open complete
mds.0.purge_queue operator(): recovering write_pos
monclient: get_auth_request con 0x55c280bc5c00 auth_method 0
monclient: get_auth_request con 0x55c280ee0c00 auth_method 0
mds.0.journaler.pq(ro) _finish_read got error -2
mds.0.purge_queue _recover: Error -2 recovering write_pos
mds.0.purge_queue _go_readonly: going readonly because internal IO
failed: No such file or directory
mds.0.journaler.pq(ro) set_readonly
mds.0.41 unhandled write error (2) No such file or directory, force
readonly...
mds.0.cache force file system read-only
force file system read-only
---snip---

I've added the dev mailing list, maybe someone can give some advice
how to continue from here (we could try to recover with an empty
metadata pool). Or is this FS lost?

Looks like one of the purge queue journal objects was lost? Were other
objects lost? It would be helpful to know more about the circumstances
of this "broken CephFS"? What Ceph version?

--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx