One reason for such observations is swap usage. If you have swap configured, you should probably disable it. Swap can be useful with ceph, but you really need to know what you are doing and how swap actually works (it is not for providing more RAM as most people tend to believe).
In my case, I have substantial amounts swap configured. Then one needs to be aware of its impact on certain ceph operations. Code and data that's rarely used, as well as leaked memory will end up on swap. During normal operations, that is not a problem. However, during exceptional operations, you are likely in a situation where all OSDs try to swap the same code/data in/out at the same time, which can temporarily lead to very large response latencies.
One of these exceptional operations are large peering operations. The code/data for peering is rarely used, so it will be on swap. The increased latency can be bad enough for MONs to mark OSDs as down for a short while, I have seen that. Usually, the cluster recovers very quickly and this is not a real issue if you have an actual OSD fail.
If you add/remove disks, it can be irritating. The workaround is to set nodown in addition to noout when doing admin. This will not only speed up peering dramatically, it will also ignore the increased heartbeat ping times during the admin operation. I see the warnings, but no detrimental effects.
In general, deploying swap in a ceph cluster is more an exception than a rule. The most common use is to allow a cluster to recover during a period of increased RAM requirements. There are cases in this list for both, MDS and OSD recoveries where having more address space was the only way forward. If deployed during normal operation, swap really needs to be fast and be able to handle simultaneous requests from many processes in parallel. Usually, only RAM is fast enough, so don't buy NVMe drives, just buy more RAM. Having some fast drives in stock for emergency swap deployment is a good idea though.
I deployed swap to cope with a memory leak that was present in mimic 13.2.8. Seems to be fixed in 13.2.10. If swap is fast enough, the impact is there but harmless. Swap on a crappy disk is dangerous.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
Sent: 08 January 2021 23:58:43
To: ceph-users@xxxxxxx
Subject: Re: osd gradual reweight question
Hi,
We are replacing HDD with SSD, and we first (gradually) drain (reweight) the HDDs with 0.5 steps until 0 = empty.
Works perfectly.
Then (just for kicks) I tried reducing HDD weight from 3.6 to 0 in one large step. That seemed to have had more impact on the cluster, and we even noticed some OSD's temporarily go down after a few minutes. It all worked out, but the impact seemed much larger.
Please clarify “impact”. Do you mean that client performance was decreased, or something else?
We never had OSDs go down when gradually reducing the weight step by step. This surprised us.
Please also clarify what you mean by going down — do you mean being marked “down” by the mons, or the daemons actually crashing? I’m not being critical — I want to fully understand your situation.
Is it expected that the impact of a sudden reweight from 3.6 to 0 is bigger than a gradual step-by-step decrease?
There are a lot of variables there, so It Depends.
For sure going in one step means that more PGs will peer, which can be expensive. I’ll speculate, with incomplete information, that this is what most of what you’re seeing.
I would assume the impact to be similar, only the time it takes to reach HEALTH_OK to be longer.
The end result, yes — the concern is how we get there.
The strategy of incremental downweighting has some advantages:
* If something goes wrong, you can stop without having a huge delta of data to move before health is restored
* Peering is spread out
* Impact on the network and drives *may* be less at a given time
A disadvantage is that you end up moving some data more than once. This was worse with older releases and CRUSH details than with recent deployments.
The impact due to data movement can be limited by lowering the usual recovery/backfill settings to 1 from their defaults, and depending on release by adjusting the osd_op_queue_cutoff.
The impact due to peering can be limited by spreading out peering, either through an incremental process like yours, or by letting the balancer module do the work.
There are other strategies as well, eg. disabling rebalancing, downweighting OSDs in sequence or a little at a time then enabling balancing when 0.
Thanks,
MJ
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx