OSD replacement causes slow requests

Eugen Block <eblock@xxxxxx> · Thu, 18 Jul 2019 10:21:26 +0000

Hi list,

we're facing an unexpected recovery behavior of an upgraded cluster  
(Luminous -> Nautilus).

We added new servers with Nautilus to the existing Luminous cluster,  
so we could first replace the MONs step by step. Then we moved the old  
servers to a new root in the crush map and then added the new OSDs to  
the default root so we would need to rebalance the data only once.  
This almost worked as planned, except for many slow and stuck  
requests. We did this after business hours so the impact was  
negligable and we didn't really investigate, the goal was to finish  
the rebalancing.

But only after two days one of the new OSDs (osd.30) already reported  
errors, so we need to replace that disk.
The replacement disk (osd.0) has been added with an initial crush  
weight of 0 (also reweight 0) to control the backfill with small steps.
This seems to be harder than it should (also than we experienced so  
far), no matter how small the steps are, the cluster immediately  
reports slow requests. We can't disrupt the production environment so  
we cancelled the backfill/recovery for now. But this procedure has  
been successful in the past with Luminous, that's why we're so  
surprised.

The recovery and backfill parameters are pretty low:

    "osd_max_backfills": "1",
    "osd_recovery_max_active": "3",

This usually allowed us a slow backfill to be able to continue  
productive work, now it doesn't.

Our ceph version is (only the active MDS still runs Luminous, the  
designated server is currently being upgraded):

14.2.0-300-gacd2f2b9e1 (acd2f2b9e196222b0350b3b59af9981f91706c7f)  
nautilus (stable)

Is there anything we missed that we should be aware of in Nautilus  
regarding recovery and replacement scenarios?
We couldn't reduce the weight of that osd lower than 0.16, everything  
else results in slow requests.
During the weight reduction several PGs keep stuck in  
activating+remapped state when, only recoverable (sometimes) by  
restarting that affected osd several times. Reducing crush weight  
leads to the same effect.

Please note: the old servers in root-ec are going to be ec-only OSDs,  
that's why they're still in the cluster.

Any pointers to what goes wrong here would be highly appreciated! If  
you need any other information I'd be happy to provide it.

Best regards,
Eugen

This is our osd tree:

ID  CLASS WEIGHT   TYPE NAME             STATUS REWEIGHT PRI-AFF
-19       11.09143 root root-ec
 -2        5.54572     host ceph01
  1   hdd  0.92429         osd.1           down        0 1.00000
  4   hdd  0.92429         osd.4             up        0 1.00000
  6   hdd  0.92429         osd.6             up        0 1.00000
 13   hdd  0.92429         osd.13            up        0 1.00000
 16   hdd  0.92429         osd.16            up        0 1.00000
 18   hdd  0.92429         osd.18            up        0 1.00000
 -3        5.54572     host ceph02
  2   hdd  0.92429         osd.2             up        0 1.00000
  5   hdd  0.92429         osd.5             up        0 1.00000
  7   hdd  0.92429         osd.7             up        0 1.00000
 12   hdd  0.92429         osd.12            up        0 1.00000
 17   hdd  0.92429         osd.17            up        0 1.00000
 19   hdd  0.92429         osd.19            up        0 1.00000
 -5              0     host ceph03
 -1       38.32857 root default
-31       10.79997     host ceph04
 25   hdd  3.59999         osd.25            up  1.00000 1.00000
 26   hdd  3.59999         osd.26            up  1.00000 1.00000
 27   hdd  3.59999         osd.27            up  1.00000 1.00000
-34       14.39995     host ceph05
  0   hdd  3.59998         osd.0             up        0 1.00000
 28   hdd  3.59999         osd.28            up  1.00000 1.00000
 29   hdd  3.59999         osd.29            up  1.00000 1.00000
 30   hdd  3.59999         osd.30            up  0.15999       0
-37       10.79997     host ceph06
 31   hdd  3.59999         osd.31            up  1.00000 1.00000
 32   hdd  3.59999         osd.32            up  1.00000 1.00000
 33   hdd  3.59999         osd.33            up  1.00000 1.00000

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com