Den fre 30 juli 2021 kl 15:22 skrev Thierry MARTIN <thierrymartin1942@xxxxxxxxxx>: > Hi all ! > We are facing strange behaviors from two clusters we have at work (both v15.2.9 / CentOS 7.9): > * In the 1st cluster we are getting errors about multiple degraded pgs and all of them are linked with a "rogue" osd which ID is very big (as "osd.2147483647"). This osd doesn't show with "ceph osd tree" and what is even weirder is that it doesn't always appear (about every 5/10 minutes)... but when it does, a lot of pgs get degraded. > The large OSD number (-1 for a signed 32bit int) just means the cluster has no info about the OSD that held this part, so it is ceph's way to say "unknown OSD". As to why you see it in a normal running cluster without long running outages I don't know. I would "ceph pg dump" one of the affected PGs until you see how the OSD list looks with and without this rogue OSD so see which OSD is acting up. The list is the numbers inside []s, so when [73,12,45,33] turns into [72,2147483647,45,33] you know that OSD.12 is doing something fishy. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx