I'm dealing with a situation in which the placement groups in an EC Pool is stuck. The EC Pool is configured as 6+2 (pool 15) with host failure domain. In this scenario, one of the nodes in the cluster was torn down and recreated with the OSDs being marked as lost and then being rebuilt from scratch. After the node had its OSDs rebuilt, there are 4 PGs which are stuck, with the infamous NONE (2147483647) for 2 of the 8 acting OSDs. As it currently stands, there are 9 nodes containing 11 OSDs each. One of the nodes have all of the OSDs marked as out, and one additional OSD (osd.93) is also marked as out of the cluster (which is confusing, because for 15.7fb this OSD is the acting primary). The cluster was on firefly when all of this started, but was upgraded to hammer (v0.94.10) with hopes that some improvements might come with the crush tunables in hammer. The choose_tries for the ec ruleset within the crushmap was increased to 100, and the crushtool testing didn't show any issues with mapping the OSDs into placement groups. The latest osdmap was extracted from one of the monitors and osdmaptool --test-map-pg 15.7fb osdmap.bin shows the acting set as also having holes in the pg mapping. Testing it with marking all OSDs as up and in (--mark-up-in) resulted in the NONE holes as well. The only thing that removed the holes was if the pg_temp was also removed in the mapping. Its not really clear how objects became unfound, as the loss of a single node with this configuration shouldn't lose objects. Any help and feedback is appreciated. Here's the ceph health detail output, additional output of some states are included in pastebins: HEALTH_ERR 4 pgs degraded; 3 pgs inconsistent; 3 pgs recovering; 3 pgs stuck degraded; 4 pgs stuck unclean; 3 pgs stuck undersized; 4 pgs undersized; 100 requests are blocked > 32 sec; 1 osds have slow requests; recovery 352034/722671544 objects degraded (0.049%); recovery 1404947/722671544 objects misplaced (0.194%); recovery 192/90258551 unfound (0.000%); 7 scrub errors; too many PGs per OSD (753 > max 300); noout flag(s) set pg 15.7fb is stuck unclean for 1745750.479304, current state active+recovering+undersized+degraded+remapped, last acting [93,12,2147483647,7,39,80,75,2147483647] pg 15.38a is stuck unclean for 1747276.253098, current state active+undersized+degraded+remapped, last acting [2147483647,95,39,80,29,8,73,2147483647] pg 15.ee is stuck unclean for 1745723.882213, current state active+recovering+undersized+degraded+remapped, last acting [2147483647,20,93,80,2147483647,39,15,69] pg 15.33c is stuck unclean for 1613257.331259, current state active+recovering+undersized+degraded+remapped, last acting [38,80,2147483647,2147483647,92,69,26,39] pg 15.7fb is stuck undersized for 48918.444257, current state active+recovering+undersized+degraded+remapped, last acting [93,12,2147483647,7,39,80,75,2147483647] pg 15.ee is stuck undersized for 48933.042271, current state active+recovering+undersized+degraded+remapped, last acting [2147483647,20,93,80,2147483647,39,15,69] pg 15.33c is stuck undersized for 48990.546803, current state active+recovering+undersized+degraded+remapped, last acting [38,80,2147483647,2147483647,92,69,26,39] pg 15.7fb is stuck degraded for 48918.445037, current state active+recovering+undersized+degraded+remapped, last acting [93,12,2147483647,7,39,80,75,2147483647] pg 15.ee is stuck degraded for 48933.043052, current state active+recovering+undersized+degraded+remapped, last acting [2147483647,20,93,80,2147483647,39,15,69] pg 15.33c is stuck degraded for 48990.547584, current state active+recovering+undersized+degraded+remapped, last acting [38,80,2147483647,2147483647,92,69,26,39] pg 15.7fb is active+recovering+undersized+degraded+remapped, acting [93,12,2147483647,7,39,80,75,2147483647], 88 unfound pg 15.7dd is active+clean+inconsistent, acting [94,83,78,25,6,55,51,9] pg 15.639 is active+clean+inconsistent, acting [50,10,77,95,57,80,23,29] pg 15.38a is active+undersized+degraded+remapped, acting [2147483647,95,39,80,29,8,73,2147483647], 27 unfound pg 15.33c is active+recovering+undersized+degraded+remapped, acting [38,80,2147483647,2147483647,92,69,26,39], 53 unfound pg 15.2c0 is active+clean+inconsistent, acting [14,98,36,70,53,65,88,42] pg 15.ee is active+recovering+undersized+degraded+remapped, acting [2147483647,20,93,80,2147483647,39,15,69], 24 unfound 100 ops are blocked > 262.144 sec 100 ops are blocked > 262.144 sec on osd.95 1 osds have slow requests recovery 352034/722671544 objects degraded (0.049%) recovery 1404947/722671544 objects misplaced (0.194%) recovery 192/90258551 unfound (0.000%) 7 scrub errors too many PGs per OSD (753 > max 300) noout flag(s) set pg query for 15.7fb - http://paste.ubuntu.com/25223721/ pg query for 15.ee - http://paste.ubuntu.com/25223724/ pg query for 15.33c - http://paste.ubuntu.com/25223728/ pg query for 15.38a - http://paste.ubuntu.com/25223731/ osd tree, ceph -s, and ceph osd dump output are located - http://paste.ubuntu.com/25223759/ OSD debug level was been cranked up to 30 for a few of the OSDs and their logs were uploaded to a gzipped tarball at (large download, ~1.1 GB): http://people.canonical.com/~wolsen/ceph-stuck-ec-pool/ceph-osd-logs.2017-07-31.tgz Thanks, ---- Billy Olsen _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com