Re: EC 8+3 Pool PGs stuck in remapped+incomplete

Nino Kotur <ninokotur@xxxxxxxxx> · Sat, 17 Jun 2023 08:46:09 +0200

problem is just that some of your OSDs have too much PGs and pool cannot
recover as it cannot create more PGs

[osd.214,osd.223,osd.548,osd.584] have slow ops.
            too many PGs per OSD (330 > max 250)

I'd have to guess that the safest thing would be permanently or temporarily
adding more storage so that PGs drop below 250, another option is just
dropping down the total number of PGs but I don't know if I would perform
that action before my pool was healthy!

in case that there is only one OSD that has this number of OSDs but all
other OSDs have less than 100-150 than you can just reweight problematic
OSD so it rebalances those "too many PGs"

But it looks to me that you have way too many PGs which is also super
negatively impacting performance.

Another option is to increase max allowed PGs per OSD to say 350 this
should also allow cluster to rebuild honestly even tho this may be easiest
option, i'd never do this, performance cost of having over 150 PGs per OSD
suffer greatly.

kind regards,
Nino

On Sat, Jun 17, 2023 at 8:23 AM Jayanth Reddy <jayanthreddy5666@xxxxxxxxx>
wrote:

> Hello Users,
> Greetings. We've a Ceph Cluster with the version
> *ceph version 14.2.5-382-g8881d33957
> (8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)*
>
> 5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and
> incomplete+remapped states. Below are the PGs,
>
> # ceph pg dump_stuck inactive
> ok
> PG_STAT STATE               UP
>  UP_PRIMARY ACTING
>                  ACTING_PRIMARY
> 15.251e          incomplete    [151,464,146,503,166,41,555,542,9,565,268]
>      151
>  [151,464,146,503,166,41,555,542,9,565,268]            151
> 15.3f3           incomplete [584,281,672,699,199,224,239,430,355,504,196]
>      584
> [584,281,672,699,199,224,239,430,355,504,196]            584
> 15.985  remapped+incomplete  [396,690,493,214,319,209,546,91,599,237,352]
>      396
>
> [2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
>            214
> 15.39d3 remapped+incomplete  [404,221,223,585,38,102,533,471,568,451,195]
>      404
>  [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
>          223
> 15.d46  remapped+incomplete [297,646,212,254,110,169,500,372,623,470,678]
>      297
> [2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
>          548
>
> Some of the OSDs had gone down on the cluster. Below is the # ceph status
>
> # ceph -s
>   cluster:
>     id:     30d6f7ee-fa02-4ab3-8a09-9321c8002794
>     health: HEALTH_WARN
>             noscrub,nodeep-scrub flag(s) set
>             1 pools have many more objects per pg than average
>             Reduced data availability: 5 pgs inactive, 5 pgs incomplete
>             Degraded data redundancy: 44798/8718528059 objects degraded
> (0.001%), 1 pg degraded, 1 pg undersized
>             22726 pgs not deep-scrubbed in time
>             23552 pgs not scrubbed in time
>             77 slow ops, oldest one blocked for 56400 sec, daemons
> [osd.214,osd.223,osd.548,osd.584] have slow ops.
>             too many PGs per OSD (330 > max 250)
>
>   services:
>     mon: 3 daemons, quorum brc1mon2,brc1mon3,brc1mon1 (age 2y)
>     mgr: brc1mon2(active, since 8d), standbys: brc1mon1, brc1mon3
>     mds: cephfs:1 {0=brc1mds2=up:active} 1 up:standby
>     osd: 1012 osds: 698 up (since 14h), 698 in (since 2d); 3 remapped pgs
>          flags noscrub,nodeep-scrub
>     rgw: 2 daemons active (brc1rgw1, brc1rgw2)
>
>   data:
>     pools:   17 pools, 23552 pgs
>     objects: 863.74M objects, 1.2 PiB
>     usage:   2.4 PiB used, 6.2 PiB / 8.6 PiB avail
>     pgs:     0.021% pgs not active
>              44798/8718528059 objects degraded (0.001%)
>              23546 active+clean
>              3     remapped+incomplete
>              2     incomplete
>              1     active+undersized+degraded
>
>   io:
>     client:   24 MiB/s rd, 3.2 KiB/s wr, 56 op/s rd, 4 op/s wr
>
> And the health detail shows as
>
> # ceph health detail
> HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 1 pools have many more
> objects per pg than average; Reduced data availability: 5 pgs inactive, 5
> pgs incomplete; Degraded data redundancy: 44798/8718528081 objects degraded
> (0.001%), 1 pg degraded, 1 pg undersized; 22726 pgs not deep-scrubbed in
> time; 23552 pgs not scrubbed in time; 77 slow ops, oldest one blocked for
> 56440 sec, daemons [osd.214,osd.223,osd.548,osd.584] have slow ops.; too
> many PGs per OSD (330 > max 250)
> OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
> MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
>     pool iscsi-images objects per pg (540004) is more than 14.7248 times
> cluster average (36673)
> PG_AVAILABILITY Reduced data availability: 5 pgs inactive, 5 pgs incomplete
>     pg 15.3f3 is incomplete, acting
> [584,281,672,699,199,224,239,430,355,504,196] (reducing pool
> default.rgw.buckets.data min_size from 9 may help; search ceph.com/docs
> for
> 'incomplete')
>     pg 15.985 is remapped+incomplete, acting
>
> [2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
> (reducing pool default.rgw.buckets.data min_size from 9 may help; search
> ceph.com/docs for 'incomplete')
>     pg 15.d46 is remapped+incomplete, acting
> [2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
> (reducing pool default.rgw.buckets.data min_size from 9 may help; search
> ceph.com/docs for 'incomplete')
>     pg 15.251e is incomplete, acting
> [151,464,146,503,166,41,555,542,9,565,268] (reducing pool
> default.rgw.buckets.data min_size from 9 may help; search ceph.com/docs
> for
> 'incomplete')
>     pg 15.39d3 is remapped+incomplete, acting
> [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
> (reducing pool default.rgw.buckets.data min_size from 9 may help; search
> ceph.com/docs for 'incomplete')
> PG_DEGRADED Degraded data redundancy: 44798/8718528081 objects degraded
> (0.001%), 1 pg degraded, 1 pg undersized
>     pg 15.28f0 is stuck undersized for 67359238.592403, current state
> active+undersized+degraded, last acting
> [2147483647,343,355,415,426,640,302,392,78,202,607]
> PG_NOT_DEEP_SCRUBBED 22726 pgs not deep-scrubbed in time
>
> We've the pools as below
>
> # ceph osd lspools
> 1 iscsi-images
> 2 cephfs_data
> 3 cephfs_metadata
> 4 .rgw.root
> 5 default.rgw.control
> 6 default.rgw.meta
> 7 default.rgw.log
> 8 default.rgw.buckets.index
> 13 rbd
> 15 default.rgw.buckets.data
> 16 default.rgw.buckets.non-ec
> 19 cephfs_data-ec
> 22 rbd-ec
> 23 iscsi-images-ec
> 24 hpecpool
> 25 hpec.rgw.buckets.index
> 26 hpec.rgw.buckets.non-ec
>
>
> We've been struggling for a long time to fix this but out of luck! Our RGW
> daemons hosted on dedicated machines are continuously failing to respond,
> being behind a load balancer, LB throws 504 Gateway Timeout as the daemons
> are failing to respond in the expected time. We perform active health
> checks from the LB on '/' by HTTP HEAD but these are failing as well, very
> frequently. Currently we're surviving by writing a script that restarts RGW
> daemons whenever the LB responds with HTTP status code 504. Any help is
> highly appreciated!
>
> Regards,
> Jayanth Reddy
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx