OSD Recovery Delay Start

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Tue, 17 Nov 2015 18:15:30 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

We are having a lot of trouble with the SSD OSDs for our cache tier
when they reboot. It causes massive blocked I/O when booting the OSD
and the entire cluster I/O nearly stalls even when the OSD is only
down for 60 seconds.

I have noticed that when the OSD starts it uses massive amounts of
RAM, for  the one minute test it used almost 8 GB, another one earlier
this morning used 14 GB, some last night were in the 10GB range.
During this time the process is not using much CPU, but the disks are
very busy writing a good 120-250 MB/s and hundreds to low thousand
IOPs. Once the memory usage gets down to about 1.5 GB blocked I/O
starts clearing slowly. At first I thought this was due to preloading
jemalloc, but it also happens without it.

Looking through [1] I thought osd recovery delay start set to 60
seconds or longer would allow the OSD to come up, join the cluster, do
any housekeeping before being in and trying to service I/O requests.
However, setting the value to 60 does nothing, we see recovery
operations start less than 30 seconds after the monitor shows the boot
message. The osd log does not show any kind of delay either.

Is there a bug here or am I understanding this option incorrectly?

What I'm looking for is something to delay any I/O until the peering
is completed, the PGs have been scanned, all of the house keeping is
done so that the only load on the OSD/disk is client/recovery I/O. I
don't want it to try to do both at the same time.

Once the OSD finally comes in and the blocked I/O clears, we can
manage backfilling and recovery without much impact to the cluster, it
is just the initial minutes of terror (dozens of blocked I/O > 500
seconds) that we can't figure out how to get rid of. I understand that
there will be some impact for recovery, but on our cluster that on
average does about 10K IOPs, we have less than 5K for 5 minutes (for a
single OSD that was down for 60 seconds). A host with two SSDs brought
our cluster to less than 2K IOPs for 15 minutes and took ten minutes
to get back to normal performance.

[1] http://docs.ceph.com/docs/v0.94/rados/configuration/osd-config-ref/#recovery

Thanks,
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWS9EvCRDmVDuy+mK58QAAzE0P/3jQt3RkDUetTyuu/E3v
wVwBtcxONs7RQHIEtamNk/eIoGsSS+PevsBK2hSvnIJWNZkhQ3U13HQQ7Hz1
awkVD3+nw72You09kC772MtAXOIcHDEQgzJHQGoxevLlJSRwIarzyMlkJqrP
g+WdAx+O3BjtOoPG+6SG1HMDqUjTw46yHkCC2iybjT9y7PBp6PZ8EN1GD+00
k2+FferROKg/VxKxwQmgWVlXIvnrSF2/bHuZeTOUybw7TWNt1q6ZSXr4ZZuY
1e0yUnj8lNMus3SC6Itdj9wBp6Ke1J4tdUZkWiTgMkK5Xykw6iAJCADPIrni
zck3SfI2XB8XXrNwvuEvuKyAleXAodPf/AbWQ9sfO88MoWFYZ3ibNbbIfAp4
SEKeZpipzxlvNCm/W2NiDD08jbcaYDqn6dj6fHSHvIelysRItLlojTXuAioZ
ORQ4JAxPnEfNCUtn/eAq46oVIjrmSPiHs2p2hMYjhANLNYz5tyAt/HNSHXzR
hnYH9y4TFIOyrB7JcAypkIKwiuGjmoMbR8RvF1hDEJRXAzj7rpePQ9FoNbU/
/uGIJlwSPEJ8UxK1TuqDJ13XvXfLbR+S1aPjt+Y5LOMYO0pFo4fsdtt8NakM
ayRsUysGM9n7hBRTWHV5zPg7MB3wjcFv2kaE5NZfwfMNbM2xXkdClYHJ53Zr
kbpr
=BJTH
-----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com