Re: OSD replacement causes slow requests

Eugen Block <eblock@xxxxxx> · Wed, 24 Jul 2019 07:48:43 +0000

Hi Wido,

thanks for your response.

Have you tried to dump the historic slow ops on the OSDs involved to see
what is going on?
$ ceph daemon osd.X dump_historic_slow_ops

Good question, I don't recall doing that. Maybe my colleague did but  
he's on vacation right now. ;-)

But to be clear, are all the OSDs on Nautilus or is there a mix of L and
N OSDs?

I'll try to clarify: it was (and still is) a mixture of L and N OSDs,  
but all L-OSDs were empty at the time. The cluster already had  
rebalanced all PGs to the new OSDs. So the L-OSDs were not involved in  
this recovery process. We're currently upgrading the remaining servers  
to Nautilus, there's one left with L-OSDs, but those OSDs don't store  
any objects at the moment (different root in crushmap).

The recovery eventually finished successfully, but my colleague had to  
do it after business hours, maybe that's why he needs his vacation. ;-)

Regards,
Eugen

Zitat von Wido den Hollander <wido@xxxxxxxx>:

On 7/18/19 12:21 PM, Eugen Block wrote:
Hi list,

we're facing an unexpected recovery behavior of an upgraded cluster
(Luminous -> Nautilus).

We added new servers with Nautilus to the existing Luminous cluster, so
we could first replace the MONs step by step. Then we moved the old
servers to a new root in the crush map and then added the new OSDs to
the default root so we would need to rebalance the data only once. This
almost worked as planned, except for many slow and stuck requests. We
did this after business hours so the impact was negligable and we didn't
really investigate, the goal was to finish the rebalancing.

But only after two days one of the new OSDs (osd.30) already reported
errors, so we need to replace that disk.
The replacement disk (osd.0) has been added with an initial crush weight
of 0 (also reweight 0) to control the backfill with small steps.
This seems to be harder than it should (also than we experienced so
far), no matter how small the steps are, the cluster immediately reports
slow requests. We can't disrupt the production environment so we
cancelled the backfill/recovery for now. But this procedure has been
successful in the past with Luminous, that's why we're so surprised.

The recovery and backfill parameters are pretty low:

    "osd_max_backfills": "1",
    "osd_recovery_max_active": "3",

This usually allowed us a slow backfill to be able to continue
productive work, now it doesn't.

Our ceph version is (only the active MDS still runs Luminous, the
designated server is currently being upgraded):

14.2.0-300-gacd2f2b9e1 (acd2f2b9e196222b0350b3b59af9981f91706c7f)
nautilus (stable)

Is there anything we missed that we should be aware of in Nautilus
regarding recovery and replacement scenarios?
We couldn't reduce the weight of that osd lower than 0.16, everything
else results in slow requests.
During the weight reduction several PGs keep stuck in
activating+remapped state when, only recoverable (sometimes) by
restarting that affected osd several times. Reducing crush weight leads
to the same effect.

Please note: the old servers in root-ec are going to be ec-only OSDs,
that's why they're still in the cluster.

Any pointers to what goes wrong here would be highly appreciated! If you
need any other information I'd be happy to provide it.

Have you tried to dump the historic slow ops on the OSDs involved to see
what is going on?

$ ceph daemon osd.X dump_historic_slow_ops

But to be clear, are all the OSDs on Nautilus or is there a mix of L and
N OSDs?

Wido

Best regards,
Eugen

This is our osd tree:

ID  CLASS WEIGHT   TYPE NAME             STATUS REWEIGHT PRI-AFF
-19       11.09143 root root-ec
 -2        5.54572     host ceph01
  1   hdd  0.92429         osd.1           down        0 1.00000
  4   hdd  0.92429         osd.4             up        0 1.00000
  6   hdd  0.92429         osd.6             up        0 1.00000
 13   hdd  0.92429         osd.13            up        0 1.00000
 16   hdd  0.92429         osd.16            up        0 1.00000
 18   hdd  0.92429         osd.18            up        0 1.00000
 -3        5.54572     host ceph02
  2   hdd  0.92429         osd.2             up        0 1.00000
  5   hdd  0.92429         osd.5             up        0 1.00000
  7   hdd  0.92429         osd.7             up        0 1.00000
 12   hdd  0.92429         osd.12            up        0 1.00000
 17   hdd  0.92429         osd.17            up        0 1.00000
 19   hdd  0.92429         osd.19            up        0 1.00000
 -5              0     host ceph03
 -1       38.32857 root default
-31       10.79997     host ceph04
 25   hdd  3.59999         osd.25            up  1.00000 1.00000
 26   hdd  3.59999         osd.26            up  1.00000 1.00000
 27   hdd  3.59999         osd.27            up  1.00000 1.00000
-34       14.39995     host ceph05
  0   hdd  3.59998         osd.0             up        0 1.00000
 28   hdd  3.59999         osd.28            up  1.00000 1.00000
 29   hdd  3.59999         osd.29            up  1.00000 1.00000
 30   hdd  3.59999         osd.30            up  0.15999       0
-37       10.79997     host ceph06
 31   hdd  3.59999         osd.31            up  1.00000 1.00000
 32   hdd  3.59999         osd.32            up  1.00000 1.00000
 33   hdd  3.59999         osd.33            up  1.00000 1.00000

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com