Hi list,
we're facing an unexpected recovery behavior of an upgraded cluster
(Luminous -> Nautilus).
We added new servers with Nautilus to the existing Luminous cluster,
so we could first replace the MONs step by step. Then we moved the old
servers to a new root in the crush map and then added the new OSDs to
the default root so we would need to rebalance the data only once.
This almost worked as planned, except for many slow and stuck
requests. We did this after business hours so the impact was
negligable and we didn't really investigate, the goal was to finish
the rebalancing.
But only after two days one of the new OSDs (osd.30) already reported
errors, so we need to replace that disk.
The replacement disk (osd.0) has been added with an initial crush
weight of 0 (also reweight 0) to control the backfill with small steps.
This seems to be harder than it should (also than we experienced so
far), no matter how small the steps are, the cluster immediately
reports slow requests. We can't disrupt the production environment so
we cancelled the backfill/recovery for now. But this procedure has
been successful in the past with Luminous, that's why we're so
surprised.
The recovery and backfill parameters are pretty low:
"osd_max_backfills": "1",
"osd_recovery_max_active": "3",
This usually allowed us a slow backfill to be able to continue
productive work, now it doesn't.
Our ceph version is (only the active MDS still runs Luminous, the
designated server is currently being upgraded):
14.2.0-300-gacd2f2b9e1 (acd2f2b9e196222b0350b3b59af9981f91706c7f)
nautilus (stable)
Is there anything we missed that we should be aware of in Nautilus
regarding recovery and replacement scenarios?
We couldn't reduce the weight of that osd lower than 0.16, everything
else results in slow requests.
During the weight reduction several PGs keep stuck in
activating+remapped state when, only recoverable (sometimes) by
restarting that affected osd several times. Reducing crush weight
leads to the same effect.
Please note: the old servers in root-ec are going to be ec-only OSDs,
that's why they're still in the cluster.
Any pointers to what goes wrong here would be highly appreciated! If
you need any other information I'd be happy to provide it.
Best regards,
Eugen
This is our osd tree:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-19 11.09143 root root-ec
-2 5.54572 host ceph01
1 hdd 0.92429 osd.1 down 0 1.00000
4 hdd 0.92429 osd.4 up 0 1.00000
6 hdd 0.92429 osd.6 up 0 1.00000
13 hdd 0.92429 osd.13 up 0 1.00000
16 hdd 0.92429 osd.16 up 0 1.00000
18 hdd 0.92429 osd.18 up 0 1.00000
-3 5.54572 host ceph02
2 hdd 0.92429 osd.2 up 0 1.00000
5 hdd 0.92429 osd.5 up 0 1.00000
7 hdd 0.92429 osd.7 up 0 1.00000
12 hdd 0.92429 osd.12 up 0 1.00000
17 hdd 0.92429 osd.17 up 0 1.00000
19 hdd 0.92429 osd.19 up 0 1.00000
-5 0 host ceph03
-1 38.32857 root default
-31 10.79997 host ceph04
25 hdd 3.59999 osd.25 up 1.00000 1.00000
26 hdd 3.59999 osd.26 up 1.00000 1.00000
27 hdd 3.59999 osd.27 up 1.00000 1.00000
-34 14.39995 host ceph05
0 hdd 3.59998 osd.0 up 0 1.00000
28 hdd 3.59999 osd.28 up 1.00000 1.00000
29 hdd 3.59999 osd.29 up 1.00000 1.00000
30 hdd 3.59999 osd.30 up 0.15999 0
-37 10.79997 host ceph06
31 hdd 3.59999 osd.31 up 1.00000 1.00000
32 hdd 3.59999 osd.32 up 1.00000 1.00000
33 hdd 3.59999 osd.33 up 1.00000 1.00000
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com