Command for trying the export was: [rook@rook-ceph-tools-recovery-77495958d9-plfch ~]$ rados export -p cephfs-replicated /mnt/recovery/backup-rados-cephfs-replicated We made sure we had enough space for this operation, and mounted the /mnt/recovery path using hostPath in the modified rook "toolbox" deployment. Regards On Mon, Jun 17, 2024 at 11:56 AM cellosofia1@xxxxxxxxx < cellosofia1@xxxxxxxxx> wrote: > Hi, > > I understand, > > We had to re-create the OSDs because of backing storage hardware failure, > so recovering from old OSDs is not possible. > > From your current understanding, is there a possibility to at least > recover some of the information, at least the fragments that are not > missing. > > I ask this because I tried to export the pool contents, but it gets stuck > (I/O blocked) because of "incomplete" PGs, maybe marking the PGs as > complete with ceph-objectstore-tool would be an option? Or using the > *dangerous* osd_find_best_info_ignore_history_les option for the affected > OSDs? > > Regards > Pablo > > On Mon, Jun 17, 2024 at 11:46 AM Matthias Grandl <matthias.grandl@xxxxxxxx> > wrote: > >> We are missing info here. Ceph status claims all OSDs are up. Did an OSD >> die and was it already removed from the CRUSH map? If so the only chance I >> see at preventing data loss is exporting the PGs off of that OSD and >> importing to another OSD. But yeah as David I am not too optimistic. >> >> Matthias Grandl >> Head Storage Engineer >> matthias.grandl@xxxxxxxx >> >> On 17. Jun 2024, at 17:26, cellosofia1@xxxxxxxxx wrote: >> >> >> Hi everyone, >> >> Thanks for your kind responses >> >> I know the following is not the best scenario, but sadly I didn't have >> the opportunity of installing this cluster >> >> More information about the problem: >> >> * We use replicated pools >> * Replica 2, min replicas 1. >> * Ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy >> (stable) >> * Virtual Machines setup, 2 MGR Nodes, 2 OSD Nodes, 4 VMs in total. >> * 27 OSDs right now >> * Rook environment: rook: v1.9.5 >> * Kubernetes Server Version: v1.24.1 >> >> I attach a .txt with the result of some diagnostic commands for reference >> >> What do you think? >> >> Regards >> Pablo >> >> On Mon, Jun 17, 2024 at 11:01 AM Matthias Grandl < >> matthias.grandl@xxxxxxxx> wrote: >> >>> Ah scratch that, my first paragraph about replicated pools is actually >>> incorrect. If it’s a replicated pool and it shows incomplete, it means the >>> most recent copy of the PG is missing. So ideal would be to recover the PG >>> from dead OSDs in any case if possible. >>> >>> Matthias Grandl >>> Head Storage Engineer >>> matthias.grandl@xxxxxxxx >>> >>> > On 17. Jun 2024, at 16:56, Matthias Grandl <matthias.grandl@xxxxxxxx> >>> wrote: >>> > >>> > Hi Pablo, >>> > >>> > It depends. If it’s a replicated setup, it might be as easy as marking >>> dead OSDs as lost to get the PGs to recover. In that case it basically just >>> means that you are below the pools min_size. >>> > >>> > If it is an EC setup, it might be quite a bit more painful, depending >>> on what happened to the dead OSDs and whether they are at all recoverable. >>> > >>> > >>> > Matthias Grandl >>> > Head Storage Engineer >>> > matthias.grandl@xxxxxxxx >>> > >>> >> On 17. Jun 2024, at 16:46, David C. <david.casier@xxxxxxxx> wrote: >>> >> >>> >> Hi Pablo, >>> >> >>> >> Could you tell us a little more about how that happened? >>> >> >>> >> Do you have a min_size >= 2 (or E/C equivalent) ? >>> >> ________________________________________________________ >>> >> >>> >> Cordialement, >>> >> >>> >> *David CASIER* >>> >> >>> >> ________________________________________________________ >>> >> >>> >> >>> >> >>> >> Le lun. 17 juin 2024 à 16:26, cellosofia1@xxxxxxxxx < >>> cellosofia1@xxxxxxxxx> >>> >> a écrit : >>> >> >>> >>> Hi community! >>> >>> >>> >>> Recently we had a major outage in production and after running the >>> >>> automated ceph recovery, some PGs remain in "incomplete" state, and >>> IO >>> >>> operations are blocked. >>> >>> >>> >>> Searching in documentation, forums, and this mailing list archive, I >>> >>> haven't found yet if this means this data is recoverable or not. We >>> don't >>> >>> have any "unknown" objects or PGs, so I believe this is somehow an >>> >>> intermediate stage where we have to tell ceph which version of the >>> objects >>> >>> to recover from. >>> >>> >>> >>> We are willing to work with a Ceph Consultant Specialist, because >>> the data >>> >>> at stage is very critical, so if you're interested please let me know >>> >>> off-list, to discuss the details. >>> >>> >>> >>> Thanks in advance >>> >>> >>> >>> Best Regards >>> >>> Pablo >>> >>> _______________________________________________ >>> >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >>> >>> >> _______________________________________________ >>> >> ceph-users mailing list -- ceph-users@xxxxxxx >>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> <diagnostic-commands-ceph.txt> >> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx