On 7/18/19 12:21 PM, Eugen Block wrote: > Hi list, > > we're facing an unexpected recovery behavior of an upgraded cluster > (Luminous -> Nautilus). > > We added new servers with Nautilus to the existing Luminous cluster, so > we could first replace the MONs step by step. Then we moved the old > servers to a new root in the crush map and then added the new OSDs to > the default root so we would need to rebalance the data only once. This > almost worked as planned, except for many slow and stuck requests. We > did this after business hours so the impact was negligable and we didn't > really investigate, the goal was to finish the rebalancing. > > But only after two days one of the new OSDs (osd.30) already reported > errors, so we need to replace that disk. > The replacement disk (osd.0) has been added with an initial crush weight > of 0 (also reweight 0) to control the backfill with small steps. > This seems to be harder than it should (also than we experienced so > far), no matter how small the steps are, the cluster immediately reports > slow requests. We can't disrupt the production environment so we > cancelled the backfill/recovery for now. But this procedure has been > successful in the past with Luminous, that's why we're so surprised. > > The recovery and backfill parameters are pretty low: > > "osd_max_backfills": "1", > "osd_recovery_max_active": "3", > > This usually allowed us a slow backfill to be able to continue > productive work, now it doesn't. > > Our ceph version is (only the active MDS still runs Luminous, the > designated server is currently being upgraded): > > 14.2.0-300-gacd2f2b9e1 (acd2f2b9e196222b0350b3b59af9981f91706c7f) > nautilus (stable) > > Is there anything we missed that we should be aware of in Nautilus > regarding recovery and replacement scenarios? > We couldn't reduce the weight of that osd lower than 0.16, everything > else results in slow requests. > During the weight reduction several PGs keep stuck in > activating+remapped state when, only recoverable (sometimes) by > restarting that affected osd several times. Reducing crush weight leads > to the same effect. > > Please note: the old servers in root-ec are going to be ec-only OSDs, > that's why they're still in the cluster. > > Any pointers to what goes wrong here would be highly appreciated! If you > need any other information I'd be happy to provide it. > Have you tried to dump the historic slow ops on the OSDs involved to see what is going on? $ ceph daemon osd.X dump_historic_slow_ops But to be clear, are all the OSDs on Nautilus or is there a mix of L and N OSDs? Wido > Best regards, > Eugen > > > This is our osd tree: > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -19 11.09143 root root-ec > -2 5.54572 host ceph01 > 1 hdd 0.92429 osd.1 down 0 1.00000 > 4 hdd 0.92429 osd.4 up 0 1.00000 > 6 hdd 0.92429 osd.6 up 0 1.00000 > 13 hdd 0.92429 osd.13 up 0 1.00000 > 16 hdd 0.92429 osd.16 up 0 1.00000 > 18 hdd 0.92429 osd.18 up 0 1.00000 > -3 5.54572 host ceph02 > 2 hdd 0.92429 osd.2 up 0 1.00000 > 5 hdd 0.92429 osd.5 up 0 1.00000 > 7 hdd 0.92429 osd.7 up 0 1.00000 > 12 hdd 0.92429 osd.12 up 0 1.00000 > 17 hdd 0.92429 osd.17 up 0 1.00000 > 19 hdd 0.92429 osd.19 up 0 1.00000 > -5 0 host ceph03 > -1 38.32857 root default > -31 10.79997 host ceph04 > 25 hdd 3.59999 osd.25 up 1.00000 1.00000 > 26 hdd 3.59999 osd.26 up 1.00000 1.00000 > 27 hdd 3.59999 osd.27 up 1.00000 1.00000 > -34 14.39995 host ceph05 > 0 hdd 3.59998 osd.0 up 0 1.00000 > 28 hdd 3.59999 osd.28 up 1.00000 1.00000 > 29 hdd 3.59999 osd.29 up 1.00000 1.00000 > 30 hdd 3.59999 osd.30 up 0.15999 0 > -37 10.79997 host ceph06 > 31 hdd 3.59999 osd.31 up 1.00000 1.00000 > 32 hdd 3.59999 osd.32 up 1.00000 1.00000 > 33 hdd 3.59999 osd.33 up 1.00000 1.00000 > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com