Hi all, I'm investigating a problem with a degenerated PG on an octopus 15.2.16 test cluster. It has 3Hosts x 3OSDs and a 4+2 EC pool with failure domain OSD. After simulating a disk fail by removing an OSD and letting the cluster recover (all under load), I end up with a PG with the same OSD allocated twice: PG 4.1c, UP: [6,1,4,5,3,NONE] ACTING: [6,1,4,5,3,1] OSD 1 is allocated twice. How is this even possible? Here the OSD tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 2.44798 root default -7 0.81599 host tceph-01 0 hdd 0.27199 osd.0 up 0.87999 1.00000 3 hdd 0.27199 osd.3 up 0.98000 1.00000 6 hdd 0.27199 osd.6 up 0.92999 1.00000 -3 0.81599 host tceph-02 2 hdd 0.27199 osd.2 up 0.95999 1.00000 4 hdd 0.27199 osd.4 up 0.89999 1.00000 8 hdd 0.27199 osd.8 up 0.89999 1.00000 -5 0.81599 host tceph-03 1 hdd 0.27199 osd.1 up 0.89999 1.00000 5 hdd 0.27199 osd.5 up 1.00000 1.00000 7 hdd 0.27199 osd.7 destroyed 0 1.00000 I tried already to change some tunables thinking about https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon, but giving up too soon is obviously not the problem. It is accepting a wrong mapping. Is there a way out of this? Clearly this is calling for trouble if not data loss and should not happen at all. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx