Re: dealing with unfound pg in 4:2 ec pool

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Fri, 1 Oct 2021 07:50:07 +0000

Sometimes but just for a short time ... but yeah

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx
---------------------------------------------------

-----Original Message-----
From: Eugen Block <eblock@xxxxxx> 
Sent: Friday, October 1, 2021 2:45 PM
To: ceph-users@xxxxxxx
Subject:  Re: dealing with unfound pg in 4:2 ec pool

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Hi,

I'm not sure if setting min_size to 4 would also fix the PGs, but the client IO would probably be restored. Marking it as lost is the last straw according to this list, luckily I haven't been in such a situation yet. So give it a try with min_size = 4 but don't forget to increase after the PGs are recovered. But keep in mind that if you decrease min_size and you lose another OSD you could face data loss.
Are your OSDs still crashing unexpected?

Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx>:

> Hi,
>
> If I set the min size of the pool to 4, will this pg be recovered?
> Or how I can take out the cluster from health error like this?
> Mark as lost seems risky based on some maillist experience, even if 
> marked lost after you still have issue, so curious what is the way to 
> take the cluster out from this and let it recover:
>
> Example problematic pg:
> dumped pgs_brief
> PG_STAT  STATE                                                 UP
>                UP_PRIMARY  ACTING
> ACTING_PRIMARY
> 28.5b    active+recovery_unfound+undersized+degraded+remapped
> [18,33,10,0,48,1]          18  [2147483647,2147483647,29,21,4,47]
>           29
>
> Cluster state:
>   cluster:
>     id:     5a07ec50-4eee-4336-aa11-46ca76edcc24
>     health: HEALTH_ERR
>             10 OSD(s) experiencing BlueFS spillover
>             4/1055070542 objects unfound (0.000%)
>             noout flag(s) set
>             Possible data damage: 2 pgs recovery_unfound
>             Degraded data redundancy: 64150765/6329079237 objects 
> degraded (1.014%), 10 pgs degraded, 26 pgs undersized
>             4 pgs not deep-scrubbed in time
>
>   services:
>     mon: 3 daemons, quorum mon-2s01,mon-2s02,mon-2s03 (age 2M)
>     mgr: mon-2s01(active, since 2M), standbys: mon-2s03, mon-2s02
>     osd: 49 osds: 49 up (since 36m), 49 in (since 4d); 28 remapped pgs
>          flags noout
>     rgw: 3 daemons active (mon-2s01.rgw0, mon-2s02.rgw0, 
> mon-2s03.rgw0)
>
>   task status:
>
>   data:
>     pools:   9 pools, 425 pgs
>     objects: 1.06G objects, 66 TiB
>     usage:   158 TiB used, 465 TiB / 623 TiB avail
>     pgs:     64150765/6329079237 objects degraded (1.014%)
>              38922319/6329079237 objects misplaced (0.615%)
>              4/1055070542 objects unfound (0.000%)
>              393 active+clean
>              13  active+undersized+remapped+backfill_wait
>              8   active+undersized+degraded+remapped+backfill_wait
>              3   active+clean+scrubbing
>              3   active+undersized+remapped+backfilling
>              2   active+recovery_unfound+undersized+degraded+remapped
>              2   active+remapped+backfill_wait
>              1   active+clean+scrubbing+deep
>
>   io:
>     client:   181 MiB/s rd, 9.4 MiB/s wr, 5.38k op/s rd, 2.42k op/s wr
>     recovery: 23 MiB/s, 389 objects/s
>
>
> Thank you.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an 
> email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx