Hi Gregory, thanks for the answer! I have look which storage nodes are missing, and it's two differrent: pg 22.240 is stuck undersized for 24437.862139, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.240 is stuck undersized for 24437.862139, current state active+undersized+degraded, last acting [ceph-04,ceph-07,ceph-02,ceph-06,2147483647,ceph-01,ceph-05] ceph-03 is missing pg 22.3e5 is stuck undersized for 24437.860025, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.3e5 is stuck undersized for 24437.860025, current state active+undersized+degraded, last acting [ceph-06,ceph-ceph-02,ceph-07,ceph-01,ceph-05,ceph-03,2147483647] ceph-04 is missing Perhaps I hit an PGs/OSD max?! I look with the script from http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd pool : 17 18 19 9 10 20 21 13 22 23 16 | SUM -------------------------------------------------------------------------------------------------------- ... host ceph-03: osd.24 0 12 2 2 4 76 16 5 74 0 66 | 257 osd.25 0 17 3 4 4 89 16 4 82 0 60 | 279 osd.26 0 20 2 5 3 71 12 5 81 0 61 | 260 osd.27 0 18 2 4 3 73 21 3 76 0 61 | 261 osd.28 0 14 2 9 4 73 23 9 94 0 64 | 292 osd.29 0 19 3 3 4 54 25 4 89 0 62 | 263 osd.30 0 22 2 6 3 80 15 6 92 0 47 | 273 osd.31 0 25 4 2 3 87 20 3 76 0 62 | 282 osd.32 0 13 4 2 2 64 14 1 82 0 69 | 251 osd.33 0 12 2 5 5 89 25 7 83 0 68 | 296 osd.34 0 28 0 8 5 81 18 3 99 0 65 | 307 osd.35 0 17 3 2 4 74 21 3 95 0 58 | 277 host ceph-04: osd.36 0 13 1 9 6 72 17 5 93 0 56 | 272 osd.37 0 21 2 5 6 83 20 4 78 0 71 | 290 osd.38 0 17 3 2 5 64 22 7 76 0 57 | 253 osd.39 0 21 3 7 6 79 27 4 80 0 68 | 295 osd.40 0 15 1 5 7 71 17 6 93 0 74 | 289 osd.41 0 16 5 5 6 76 18 6 95 0 70 | 297 osd.42 0 13 0 6 1 71 25 4 83 0 56 | 259 osd.43 0 20 2 2 6 81 23 4 89 0 59 | 286 osd.44 0 21 2 5 6 77 9 5 76 0 52 | 253 osd.45 0 11 4 8 3 76 24 6 82 0 49 | 263 osd.46 0 17 2 5 6 57 15 4 84 0 62 | 252 osd.47 0 19 3 2 3 84 19 5 94 0 48 | 277 ... -------------------------------------------------------------------------------------------------------- SUM : 768 1536 192 384 384 6144 1536 384 7168 24 5120 | Pool 22 is the new ec7archiv. But on ceph-04 there aren't OSD with more than 300 PGs... Udo Am 25.03.2015 14:52, schrieb Gregory Farnum: > On Wed, Mar 25, 2015 at 1:20 AM, Udo Lembke <ulembke@xxxxxxxxxxxx> wrote: >> Hi, >> due to two more hosts (now 7 storage nodes) I want to create an new >> ec-pool and get an strange effect: >> >> ceph@admin:~$ ceph health detail >> HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 >> pgs stuck undersized; 2 pgs undersized > > This is the big clue: you have two undersized PGs! > >> pg 22.3e5 is stuck unclean since forever, current state >> active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] > > 2147483647 is the largest number you can represent in a signed 32-bit > integer. There's an output error of some kind which is fixed > elsewhere; this should be "-1". > > So for whatever reason (in general it's hard on CRUSH trying to select > N entries out of N choices), CRUSH hasn't been able to map an OSD to > this slot for you. You'll want to figure out why that is and fix it. > -Greg > >> pg 22.240 is stuck unclean since forever, current state >> active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] >> pg 22.3e5 is stuck undersized for 406.614447, current state >> active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] >> pg 22.240 is stuck undersized for 406.616563, current state >> active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] >> pg 22.3e5 is stuck degraded for 406.614566, current state >> active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] >> pg 22.240 is stuck degraded for 406.616679, current state >> active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] >> pg 22.3e5 is active+undersized+degraded, acting >> [76,15,82,11,57,29,2147483647] >> pg 22.240 is active+undersized+degraded, acting >> [38,85,17,74,2147483647,10,58] >> >> But I have only 91 OSDs (84 Sata + 7 SSDs) not 2147483647! >> Where the heck came the 2147483647 from? >> >> I do following commands: >> ceph osd erasure-code-profile set 7hostprofile k=5 m=2 >> ruleset-failure-domain=host >> ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile >> >> my version: >> ceph -v >> ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) >> >> >> I found an issue in my crush-map - one SSD was twice in the map: >> host ceph-061-ssd { >> id -16 # do not change unnecessarily >> # weight 0.000 >> alg straw >> hash 0 # rjenkins1 >> } >> root ssd { >> id -13 # do not change unnecessarily >> # weight 0.780 >> alg straw >> hash 0 # rjenkins1 >> item ceph-01-ssd weight 0.170 >> item ceph-02-ssd weight 0.170 >> item ceph-03-ssd weight 0.000 >> item ceph-04-ssd weight 0.170 >> item ceph-05-ssd weight 0.170 >> item ceph-06-ssd weight 0.050 >> item ceph-07-ssd weight 0.050 >> item ceph-061-ssd weight 0.000 >> } >> >> Host ceph-061-ssd don't excist and osd-61 is the SSD from ceph-03-ssd, >> but after fix the crusmap the issue with the osd 2147483647 still excist. >> >> Any idea how to fix that? >> >> regards >> >> Udo >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com