Re: EC 8+3 Pool PGs stuck in remapped+incomplete

Jayanth Reddy <jayanthreddy5666@xxxxxxxxx> · Sat, 17 Jun 2023 07:01:27 +0000

Thanks, Nino.

Would give these initial suggestions a try and let you know at the earliest.

Regards,
Jayanth Reddy
________________________________
From: Nino Kotur <ninokotur@xxxxxxxxx>
Sent: Saturday, June 17, 2023 12:16:09 PM
To: Jayanth Reddy <jayanthreddy5666@xxxxxxxxx>
Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: Re:  EC 8+3 Pool PGs stuck in remapped+incomplete

problem is just that some of your OSDs have too much PGs and pool cannot recover as it cannot create more PGs

[osd.214,osd.223,osd.548,osd.584] have slow ops.
            too many PGs per OSD (330 > max 250)

I'd have to guess that the safest thing would be permanently or temporarily adding more storage so that PGs drop below 250, another option is just dropping down the total number of PGs but I don't know if I would perform that action before my pool was healthy!

in case that there is only one OSD that has this number of OSDs but all other OSDs have less than 100-150 than you can just reweight problematic OSD so it rebalances those "too many PGs"

But it looks to me that you have way too many PGs which is also super negatively impacting performance.

Another option is to increase max allowed PGs per OSD to say 350 this should also allow cluster to rebuild honestly even tho this may be easiest option, i'd never do this, performance cost of having over 150 PGs per OSD suffer greatly.

kind regards,
Nino

On Sat, Jun 17, 2023 at 8:23 AM Jayanth Reddy <jayanthreddy5666@xxxxxxxxx<mailto:jayanthreddy5666@xxxxxxxxx>> wrote:
Hello Users,
Greetings. We've a Ceph Cluster with the version
*ceph version 14.2.5-382-g8881d33957
(8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)*

5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and
incomplete+remapped states. Below are the PGs,

# ceph pg dump_stuck inactive
ok
PG_STAT STATE               UP
 UP_PRIMARY ACTING
                 ACTING_PRIMARY
15.251e          incomplete    [151,464,146,503,166,41,555,542,9,565,268]
     151
 [151,464,146,503,166,41,555,542,9,565,268]            151
15.3f3           incomplete [584,281,672,699,199,224,239,430,355,504,196]
     584
[584,281,672,699,199,224,239,430,355,504,196]            584
15.985  remapped+incomplete  [396,690,493,214,319,209,546,91,599,237,352]
     396
[2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
           214
15.39d3 remapped+incomplete  [404,221,223,585,38,102,533,471,568,451,195]
     404
 [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
         223
15.d46  remapped+incomplete [297,646,212,254,110,169,500,372,623,470,678]
     297
[2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
         548

Some of the OSDs had gone down on the cluster. Below is the # ceph status

# ceph -s
  cluster:
    id:     30d6f7ee-fa02-4ab3-8a09-9321c8002794
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set
            1 pools have many more objects per pg than average
            Reduced data availability: 5 pgs inactive, 5 pgs incomplete
            Degraded data redundancy: 44798/8718528059 objects degraded
(0.001%), 1 pg degraded, 1 pg undersized
            22726 pgs not deep-scrubbed in time
            23552 pgs not scrubbed in time
            77 slow ops, oldest one blocked for 56400 sec, daemons
[osd.214,osd.223,osd.548,osd.584] have slow ops.
            too many PGs per OSD (330 > max 250)

  services:
    mon: 3 daemons, quorum brc1mon2,brc1mon3,brc1mon1 (age 2y)
    mgr: brc1mon2(active, since 8d), standbys: brc1mon1, brc1mon3
    mds: cephfs:1 {0=brc1mds2=up:active} 1 up:standby
    osd: 1012 osds: 698 up (since 14h), 698 in (since 2d); 3 remapped pgs
         flags noscrub,nodeep-scrub
    rgw: 2 daemons active (brc1rgw1, brc1rgw2)

  data:
    pools:   17 pools, 23552 pgs
    objects: 863.74M objects, 1.2 PiB
    usage:   2.4 PiB used, 6.2 PiB / 8.6 PiB avail
    pgs:     0.021% pgs not active
             44798/8718528059 objects degraded (0.001%)
             23546 active+clean
             3     remapped+incomplete
             2     incomplete
             1     active+undersized+degraded

  io:
    client:   24 MiB/s rd, 3.2 KiB/s wr, 56 op/s rd, 4 op/s wr

And the health detail shows as

# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 1 pools have many more
objects per pg than average; Reduced data availability: 5 pgs inactive, 5
pgs incomplete; Degraded data redundancy: 44798/8718528081 objects degraded
(0.001%), 1 pg degraded, 1 pg undersized; 22726 pgs not deep-scrubbed in
time; 23552 pgs not scrubbed in time; 77 slow ops, oldest one blocked for
56440 sec, daemons [osd.214,osd.223,osd.548,osd.584] have slow ops.; too
many PGs per OSD (330 > max 250)
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
    pool iscsi-images objects per pg (540004) is more than 14.7248 times
cluster average (36673)
PG_AVAILABILITY Reduced data availability: 5 pgs inactive, 5 pgs incomplete
    pg 15.3f3 is incomplete, acting
[584,281,672,699,199,224,239,430,355,504,196] (reducing pool
default.rgw.buckets.data min_size from 9 may help; search ceph.com/docs<http://ceph.com/docs> for
'incomplete')
    pg 15.985 is remapped+incomplete, acting
[2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
(reducing pool default.rgw.buckets.data min_size from 9 may help; search
ceph.com/docs<http://ceph.com/docs> for 'incomplete')
    pg 15.d46 is remapped+incomplete, acting
[2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
(reducing pool default.rgw.buckets.data min_size from 9 may help; search
ceph.com/docs<http://ceph.com/docs> for 'incomplete')
    pg 15.251e is incomplete, acting
[151,464,146,503,166,41,555,542,9,565,268] (reducing pool
default.rgw.buckets.data min_size from 9 may help; search ceph.com/docs<http://ceph.com/docs> for
'incomplete')
    pg 15.39d3 is remapped+incomplete, acting
[2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
(reducing pool default.rgw.buckets.data min_size from 9 may help; search
ceph.com/docs<http://ceph.com/docs> for 'incomplete')
PG_DEGRADED Degraded data redundancy: 44798/8718528081 objects degraded
(0.001%), 1 pg degraded, 1 pg undersized
    pg 15.28f0 is stuck undersized for 67359238.592403, current state
active+undersized+degraded, last acting
[2147483647,343,355,415,426,640,302,392,78,202,607]
PG_NOT_DEEP_SCRUBBED 22726 pgs not deep-scrubbed in time

We've the pools as below

# ceph osd lspools
1 iscsi-images
2 cephfs_data
3 cephfs_metadata
4 .rgw.root
5 default.rgw.control
6 default.rgw.meta
7 default.rgw.log
8 default.rgw.buckets.index
13 rbd
15 default.rgw.buckets.data
16 default.rgw.buckets.non-ec
19 cephfs_data-ec
22 rbd-ec
23 iscsi-images-ec
24 hpecpool
25 hpec.rgw.buckets.index
26 hpec.rgw.buckets.non-ec

We've been struggling for a long time to fix this but out of luck! Our RGW
daemons hosted on dedicated machines are continuously failing to respond,
being behind a load balancer, LB throws 504 Gateway Timeout as the daemons
are failing to respond in the expected time. We perform active health
checks from the LB on '/' by HTTP HEAD but these are failing as well, very
frequently. Currently we're surviving by writing a script that restarts RGW
daemons whenever the LB responds with HTTP status code 504. Any help is
highly appreciated!

Regards,
Jayanth Reddy
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx