Re: Rogue osd / CephFS / Adding osd

Janne Johansson <icepic.dz@xxxxxxxxx> · Fri, 30 Jul 2021 15:39:24 +0200

Den fre 30 juli 2021 kl 15:22 skrev Thierry MARTIN
<thierrymartin1942@xxxxxxxxxx>:
> Hi all !
> We are facing strange behaviors from two clusters we have at work (both v15.2.9 / CentOS 7.9):
>   *   In the 1st cluster we are getting errors about multiple degraded pgs and all of them are linked with a "rogue" osd which ID is very big (as "osd.2147483647"). This osd doesn't show with "ceph osd tree" and what is even weirder is that it doesn't always appear (about every 5/10 minutes)... but when it does, a lot of pgs get degraded.
>

The large OSD number (-1 for a signed 32bit int) just means the
cluster has no info about the OSD that held this part, so it is ceph's
way to say "unknown OSD".
As to why you see it in a normal running cluster without long running
outages I don't know. I would "ceph pg dump" one of the affected PGs
until you see how the OSD list looks with and without this rogue OSD
so see which OSD is acting up. The list is the numbers inside []s, so
when [73,12,45,33] turns into [72,2147483647,45,33] you know that
OSD.12 is doing something fishy.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx