Re: ghost PG : "i don't have pgid xx"

Paul Emmerich <paul.emmerich@xxxxxxxx> · Tue, 5 Jun 2018 10:39:14 +0200

Hi,

looks like you are running into the PG overdose protection of Luminous (you got > 200 PGs per OSD): try to increase mon_max_pg_per_osd on the monitors to 300 or so to temporarily resolve this.

Paul

2018-06-05 9:40 GMT+02:00 Olivier Bonvalet <ceph.list@xxxxxxxxx>:
Some more informations : the cluster was just upgraded from Jewel to

Luminous.

# ceph pg dump | egrep '(stale|creating)'

dumped all

15.32     10947                  0        0         0       0  45870301184  3067     3067                                stale+active+clean 2018-06-04 09:20:42.594317   387644'251008     437722:754803                    [48,31,45]         48                    [48,31,45]             48   213014'224196 2018-04-22 02:01:09.148152   200181'219150 2018-04-14 14:40:13.116285             0 

19.77      4131                  0        0         0       0  17326669824  3076     3076                                        stale+down 2018-06-05 07:28:33.968860    394478'58307     438699:736881                  [NONE,20,76]         20                  [NONE,20,76]             20    273736'49495 2018-05-17 01:05:35.523735    273736'49495 2018-05-17 01:05:35.523735             0 

13.76     10730                  0        0         0       0  44127133696  3011     3011                                        stale+down 2018-06-05 07:30:27.578512   397231'457143    438813:4600135                  [NONE,21,76]         21                  [NONE,21,76]             21   286462'438402 2018-05-20 18:06:12.443141   286462'438402 2018-05-20 18:06:12.443141             0 

Le mardi 05 juin 2018 à 09:25 +0200, Olivier Bonvalet a écrit :

> Hi,

> 

> I have a cluster in "stale" state : a lots of RBD are blocked since

> ~10

> hours. In the status I see PG in stale or down state, but thoses PG

> doesn't seem to exists anymore :

> 

> root! stor00-sbg:~# ceph health detail | egrep '(stale|down)'

> HEALTH_ERR noout,noscrub,nodeep-scrub flag(s) set; 1 nearfull osd(s);

> 16 pool(s) nearfull; 4645278/103969515 objects misplaced (4.468%);

> Reduced data availability: 643 pgs inactive, 12 pgs down, 2 pgs

> peering, 3 pgs stale; Degraded data redundancy: 2723173/103969515

> objects degraded (2.619%), 387 pgs degraded, 297 pgs undersized; 229

> slow requests are blocked > 32 sec; 4074 stuck requests are blocked >

> 4096 sec; too many PGs per OSD (202 > max 200); mons hyp01-sbg,hyp02-

> sbg,hyp03-sbg are using a lot of disk space

> PG_AVAILABILITY Reduced data availability: 643 pgs inactive, 12 pgs

> down, 2 pgs peering, 3 pgs stale

>     pg 31.8b is down, acting [2147483647,16,36]

>     pg 31.8e is down, acting [2147483647,29,19]

>     pg 46.b8 is down, acting [2147483647,2147483647,13,17,47,28]

> 

> root! stor00-sbg:~# ceph pg 31.8b query

> Error ENOENT: i don't have pgid 31.8b

> 

> root! stor00-sbg:~# ceph pg 31.8e query

> Error ENOENT: i don't have pgid 31.8e

> 

> root! stor00-sbg:~# ceph pg 46.b8 query

> Error ENOENT: i don't have pgid 46.b8

> 

> 

> We just loose an HDD, and mark the corresponding OSD as "lost".

> 

> Any idea of what should I do ?

> 

> Thanks,

> 

> Olivier

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> 

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com