Re: Data loss by adding 2OSD causing Long heartbeat ping times

Martin Verges <martin.verges@xxxxxxxx> · Thu, 7 May 2020 12:17:10 +0200

Hello XuYun,

In my experience, I would always disable swap, it won't do any good.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

Am Do., 7. Mai 2020 um 12:07 Uhr schrieb XuYun <yunxu@xxxxxx>:

> We had got some ping back/front problems after upgrading from filestore to
> bluestore. It turned out to be related to insufficient memory/swap usage.
>
> > 2020年5月6日 下午10:08，Frank Schilder <frans@xxxxxx> 写道：
> >
> > To answer some of my own questions:
> >
> > 1) Setting
> >
> > ceph osd set noout
> > ceph osd set nodown
> > ceph osd set norebalance
> >
> > before restart/re-deployment did not harm. I don't know if it helped,
> because I didn't retry the procedure that led to OSDs going down. See also
> point 3 below.
> >
> > 2) A peculiarity of this specific deployment of 2 OSDs was, that it was
> a mix of OSD deployment and restart after a reboot. I'm working on getting
> this sorted and this is a different story. For anyone who might find
> him-/herself in a situation where some OSDs are temporarily down/out with
> PGs remapped and objects degraded for whatever reason while new OSDs come
> up, the way to have ceph rescan the down/out OSDs after they come up is to
> >
> > - "ceph osd crush move" the new OSDs temporarily to a location outside
> the crush sub tree covering any pools (I have such a parking space in the
> crush hierarchy for easy draining and parking disks)
> > - bring up the down/out OSDs
> > - at this point, the cluster will fall back to the original crush map
> that was in place when the OSDs went down/out
> > - the cluster will now find all shards that went orphan and health will
> be restored very quickly
> > - once the cluster is healthy, "ceph osd crush move" the new OSDs back
> to their desired location
> > - now you will see remapped PGs/misplaced objects, but no degraded
> objects
> >
> > 3) I still don't have an answer why long heartbeat ping times were
> observed. There seems to be a more serious issue and this will continue in
> its own thread "Cluster outage due to client IO" to be opened soon.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Frank Schilder <frans@xxxxxx>
> > Sent: 25 April 2020 15:34:25
> > To: ceph-users
> > Subject:  Data loss by adding 2OSD causing Long heartbeat
> ping times
> >
> > Dear all,
> >
> > Two days ago I added very few disks to a ceph cluster and run into a
> problem I have never seen before when doing that. The entire cluster was
> deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the
> first time I added OSDs under 13.2.8.
> >
> > I had a few hosts that I needed to add 1 or 2 OSDs to and I started with
> one that needed 1. Procedure was as usual:
> >
> > ceph osd set norebalance
> > deploy additional OSD
> >
> > The OSD came up and PGs started peering, so far so good. To my surprise,
> however, I started seeing health-warnings about slow ping times:
> >
> > Long heartbeat ping times on back interface seen, longest is 1171.910
> msec
> > Long heartbeat ping times on front interface seen, longest is 1180.764
> msec
> >
> > After peering it looked like it got better and I waited it out until the
> messages were gone. This took a really long time, at least 5-10 minutes.
> >
> > I went on to the next host and deployed 2 new OSDs this time. Same as
> above, but with much worse consequences. Apparently, the ping times
> exceeded a timeout for a very short moment and an OSD was marked out for
> ca. 2 seconds. Now all hell broke loose. I got health errors with the
> dreaded "backfill_toofull", undersized PGs and a large amount of degraded
> objects. I don't know what is causing what, but I ended up with data loss
> by just adding 2 disks.
> >
> > We have dedicated network hardware and each of the OSD hosts has 20GBit
> front and 40GBit back network capacity (LACP trunking).  There are
> currently no more than 16 disks per server. The disks were added to an SSD
> pool. There was no traffic nor any other exceptional load on the system. I
> have ganglia resource monitoring on all nodes and cannot see a single curve
> going up. Network, CPU utilisation, load, everything below measurement
> accuracy. The hosts and network are quite overpowered and dimensioned to
> host many more OSDs (in future expansions).
> >
> > I have three questions, ordered by how urgently I need an answer:
> >
> > 1) I need to add more disks next week and need a workaround. Will
> something like this help avoiding the heartbeat time-out:
> >
> > ceph osd set noout
> > ceph osd set nodown
> > ceph osd set norebalance
> >
> > 2) The "lost" shards of the degraded objects were obviously still on the
> cluster somewhere. Is there any way to force the cluster to rescan OSDs for
> the shards that went orphan during the incident?
> >
> > 3) This smells a bit like a bug that requires attention. I was probably
> just lucky that I only lost 1 shard per PG. Has something similar reported
> before? Is this fixed in 13.2.10? Is it something new? Any settings that
> need to be looked at? If logs need to be collected, I can do so during my
> next attempt. However, I cannot risk data integrity of a production cluster
> and, therefore, probably not run the original procedure again.
> >
> > Many thanks for your help and best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx