PG merge: PG stuck in premerge+peered state

Konstantin Shalygin <k0ste@xxxxxxxx> · Sun, 5 Sep 2021 17:53:48 +0300

Hi,

Does somebody see PG inactive like this before?

We get first pool outage:

PG_AVAILABILITY Reduced data availability: 2 pgs inactive
    pg 4.1f1 is stuck inactive for 8637.783533, current state clean+premerge+peered, last acting [312,358,331]
    pg 4.9f1 is stuck inactive for 8637.783331, current state remapped+premerge+backfilling+peered, last acting [312,331,374]

Then added alerts for premerge state for long time, and get second outage:

PG_AVAILABILITY Reduced data availability: 2 pgs inactive
    pg 4.1d9 is stuck inactive for 1000.400328, current state remapped+premerge+backfilling+peered, last acting [328,315,352]
    pg 4.9d9 is stuck inactive for 1000.400333, current state remapped+premerge+backfill_wait+peered, last acting [328,315,352]

Actually, from Nautilus we was made PG reduce many times. This is first time problem occurence and only on one cluster
Before this, PG reducing was going for a week on this cluster, but only one by one PG and ignore max_misplaced option totally - I wasn't debug why
For a 3 hour before first outage - osd.362 was added to this pool

Tracker for this - https://tracker.ceph.com/issues/52509 <https://tracker.ceph.com/issues/52509>

Thanks,
k
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx