Hi, Good point ! Changing this value, *and* restarting ceph-mgr fix this issue. Now we have to find a way to reduce PG account. Thanks Paul ! Olivier Le mardi 05 juin 2018 à 10:39 +0200, Paul Emmerich a écrit : > Hi, > > looks like you are running into the PG overdose protection of > Luminous (you got > 200 PGs per OSD): try to increase > mon_max_pg_per_osd on the monitors to 300 or so to temporarily > resolve this. > > Paul > > 2018-06-05 9:40 GMT+02:00 Olivier Bonvalet <ceph.list@xxxxxxxxx>: > > Some more informations : the cluster was just upgraded from Jewel > > to > > Luminous. > > > > # ceph pg dump | egrep '(stale|creating)' > > dumped all > > 15.32 10947 0 0 0 0 > > 45870301184 3067 3067 > > stale+active+clean 2018-06-04 09:20:42.594317 387644'251008 > > 437722:754803 [48,31,45] 48 > > [48,31,45] 48 213014'224196 2018-04-22 > > 02:01:09.148152 200181'219150 2018-04-14 14:40:13.116285 > > 0 > > 19.77 4131 0 0 0 0 > > 17326669824 3076 3076 > > stale+down 2018-06-05 07:28:33.968860 394478'58307 > > 438699:736881 [NONE,20,76] 20 > > [NONE,20,76] 20 273736'49495 2018-05-17 > > 01:05:35.523735 273736'49495 2018-05-17 01:05:35.523735 > > 0 > > 13.76 10730 0 0 0 0 > > 44127133696 3011 3011 > > stale+down 2018-06-05 07:30:27.578512 397231'457143 > > 438813:4600135 [NONE,21,76] 21 > > [NONE,21,76] 21 286462'438402 2018-05-20 > > 18:06:12.443141 286462'438402 2018-05-20 18:06:12.443141 > > 0 > > > > > > > > > > Le mardi 05 juin 2018 à 09:25 +0200, Olivier Bonvalet a écrit : > > > Hi, > > > > > > I have a cluster in "stale" state : a lots of RBD are blocked > > since > > > ~10 > > > hours. In the status I see PG in stale or down state, but thoses > > PG > > > doesn't seem to exists anymore : > > > > > > root! stor00-sbg:~# ceph health detail | egrep '(stale|down)' > > > HEALTH_ERR noout,noscrub,nodeep-scrub flag(s) set; 1 nearfull > > osd(s); > > > 16 pool(s) nearfull; 4645278/103969515 objects misplaced > > (4.468%); > > > Reduced data availability: 643 pgs inactive, 12 pgs down, 2 pgs > > > peering, 3 pgs stale; Degraded data redundancy: 2723173/103969515 > > > objects degraded (2.619%), 387 pgs degraded, 297 pgs undersized; > > 229 > > > slow requests are blocked > 32 sec; 4074 stuck requests are > > blocked > > > > 4096 sec; too many PGs per OSD (202 > max 200); mons hyp01- > > sbg,hyp02- > > > sbg,hyp03-sbg are using a lot of disk space > > > PG_AVAILABILITY Reduced data availability: 643 pgs inactive, 12 > > pgs > > > down, 2 pgs peering, 3 pgs stale > > > pg 31.8b is down, acting [2147483647,16,36] > > > pg 31.8e is down, acting [2147483647,29,19] > > > pg 46.b8 is down, acting [2147483647,2147483647,13,17,47,28] > > > > > > root! stor00-sbg:~# ceph pg 31.8b query > > > Error ENOENT: i don't have pgid 31.8b > > > > > > root! stor00-sbg:~# ceph pg 31.8e query > > > Error ENOENT: i don't have pgid 31.8e > > > > > > root! stor00-sbg:~# ceph pg 46.b8 query > > > Error ENOENT: i don't have pgid 46.b8 > > > > > > > > > We just loose an HDD, and mark the corresponding OSD as "lost". > > > > > > Any idea of what should I do ? > > > > > > Thanks, > > > > > > Olivier > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com