Boom!! Fixed it. Not sure if the behavior I stumbled from is correct, but this has a potential to break a few things for people moving from Jewel to Luminous if they potentially had a few too many PG’s. Firstly, how I stumbled across it. I whacked the logging up to max on OSD 68 and saw this mentioned in the logs osd.68 106454 maybe_wait_for_max_pg withhold creation of pg 0.1cf: 403 >= 400 This made me search through the code for this warning string https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L4221 Which jogged my memory about the changes in Luminous regarding max PG’s warning, and in particular these two config options mon_max_pg_per_osd osd_max_pg_per_osd_hard_ratio In my cluster I have just over 200 PG’s per OSD, but the node with OSD.68 in, has 8TB disks instead of 3TB for the rest of the cluster. This means these OSD’s were taking a lot more PG’s than the average would suggest. So in Luminous 200x2 gives a hard limit of 400, which is what that error message in the log suggests is the limit. I set the osd_max_pg_per_osd_hard_ratio option to 3 and restarted the OSD and hey presto everything fell into line. Now a question. I get the idea around these settings to stop making too many or pools with too many PG’s. But is it correct they can break an existing pool which is maybe making the new PG on an OSD due to CRUSH layout being modified? Nick From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Nick Fisk On Tue, Dec 12, 2017 at 12:33 PM Nick Fisk <nick@xxxxxxxxxx> wrote:
Did that fix anything? I don't see anything immediately obvious but I'm not practiced in quickly reading that pg state output. What's the output of "ceph -s"? Hi Greg, No restarting OSD’s didn’t seem to help. But I did make some progress late last night. By stopping OSD.68 the cluster unlocks itself and IO can progress. However as soon as it starts back up, 0.1cf and a couple of other PG’s again get stuck in an activating state. If I out the OSD, either with it up or down, then some other PG’s seem to get hit by the same problem as CRUSH moves PG mappings around to other OSD’s. So there definitely seems to be some sort of weird peering issue somewhere. I have seen a very similar issue before on this cluster where after running the crush reweight script to balance OSD utilization, the weight got set too low and PG’s were unable to peer. I’m not convinced this is what’s happening here as all the weights haven’t changed, but I’m intending to explore this further just in case. With 68 down pgs: 1071783/48650631 objects degraded (2.203%) 5923 active+clean 399 active+undersized+degraded 7 active+clean+scrubbing+deep 7 active+clean+remapped With it up pgs: 0.047% pgs not active 67271/48651279 objects degraded (0.138%) 15602/48651279 objects misplaced (0.032%) 6051 active+clean 273 active+recovery_wait+degraded 4 active+clean+scrubbing+deep 4 active+remapped+backfill_wait 3 activating+remapped
PG Dump ceph pg dump | grep activatin dumped all 2.389 0 0 0 0 0 0 1500 1500 activating+remapped 2017-12-13 11:08:50.990526 76271'34230 106239:160310 [68,60,58,59,29,23] 68 [62,60,58,59,29,23] 62 76271'34230 2017-12-13 09:00:08.359690 76271'34230 2017-12-10 10:05:10.931366 0.1cf 3947 0 0 0 0 16472186880 1577 1577 activating+remapped 2017-12-13 11:08:50.641034 106236'7512915 106239:6176548 [34,68,8] 34 [34,8,53] 34 106138'7512682 2017-12-13 10:27:37.400613 106138'7512682 2017-12-13 10:27:37.400613 2.210 0 0 0 0 0 0 1500 1500 activating+remapped 2017-12-13 11:08:50.686193 76271'33304 106239:96797 [68,67,34,36,16,15] 68 [62,67,34,36,16,15] 62 76271'33304 2017-12-12 00:49:21.038437 76271'33304 2017-12-10 16:05:12.751425
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com