Hello XuYun, In my experience, I would always disable swap, it won't do any good. -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.verges@xxxxxxxx Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Do., 7. Mai 2020 um 12:07 Uhr schrieb XuYun <yunxu@xxxxxx>: > We had got some ping back/front problems after upgrading from filestore to > bluestore. It turned out to be related to insufficient memory/swap usage. > > > 2020年5月6日 下午10:08,Frank Schilder <frans@xxxxxx> 写道: > > > > To answer some of my own questions: > > > > 1) Setting > > > > ceph osd set noout > > ceph osd set nodown > > ceph osd set norebalance > > > > before restart/re-deployment did not harm. I don't know if it helped, > because I didn't retry the procedure that led to OSDs going down. See also > point 3 below. > > > > 2) A peculiarity of this specific deployment of 2 OSDs was, that it was > a mix of OSD deployment and restart after a reboot. I'm working on getting > this sorted and this is a different story. For anyone who might find > him-/herself in a situation where some OSDs are temporarily down/out with > PGs remapped and objects degraded for whatever reason while new OSDs come > up, the way to have ceph rescan the down/out OSDs after they come up is to > > > > - "ceph osd crush move" the new OSDs temporarily to a location outside > the crush sub tree covering any pools (I have such a parking space in the > crush hierarchy for easy draining and parking disks) > > - bring up the down/out OSDs > > - at this point, the cluster will fall back to the original crush map > that was in place when the OSDs went down/out > > - the cluster will now find all shards that went orphan and health will > be restored very quickly > > - once the cluster is healthy, "ceph osd crush move" the new OSDs back > to their desired location > > - now you will see remapped PGs/misplaced objects, but no degraded > objects > > > > 3) I still don't have an answer why long heartbeat ping times were > observed. There seems to be a more serious issue and this will continue in > its own thread "Cluster outage due to client IO" to be opened soon. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Frank Schilder <frans@xxxxxx> > > Sent: 25 April 2020 15:34:25 > > To: ceph-users > > Subject: Data loss by adding 2OSD causing Long heartbeat > ping times > > > > Dear all, > > > > Two days ago I added very few disks to a ceph cluster and run into a > problem I have never seen before when doing that. The entire cluster was > deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the > first time I added OSDs under 13.2.8. > > > > I had a few hosts that I needed to add 1 or 2 OSDs to and I started with > one that needed 1. Procedure was as usual: > > > > ceph osd set norebalance > > deploy additional OSD > > > > The OSD came up and PGs started peering, so far so good. To my surprise, > however, I started seeing health-warnings about slow ping times: > > > > Long heartbeat ping times on back interface seen, longest is 1171.910 > msec > > Long heartbeat ping times on front interface seen, longest is 1180.764 > msec > > > > After peering it looked like it got better and I waited it out until the > messages were gone. This took a really long time, at least 5-10 minutes. > > > > I went on to the next host and deployed 2 new OSDs this time. Same as > above, but with much worse consequences. Apparently, the ping times > exceeded a timeout for a very short moment and an OSD was marked out for > ca. 2 seconds. Now all hell broke loose. I got health errors with the > dreaded "backfill_toofull", undersized PGs and a large amount of degraded > objects. I don't know what is causing what, but I ended up with data loss > by just adding 2 disks. > > > > We have dedicated network hardware and each of the OSD hosts has 20GBit > front and 40GBit back network capacity (LACP trunking). There are > currently no more than 16 disks per server. The disks were added to an SSD > pool. There was no traffic nor any other exceptional load on the system. I > have ganglia resource monitoring on all nodes and cannot see a single curve > going up. Network, CPU utilisation, load, everything below measurement > accuracy. The hosts and network are quite overpowered and dimensioned to > host many more OSDs (in future expansions). > > > > I have three questions, ordered by how urgently I need an answer: > > > > 1) I need to add more disks next week and need a workaround. Will > something like this help avoiding the heartbeat time-out: > > > > ceph osd set noout > > ceph osd set nodown > > ceph osd set norebalance > > > > 2) The "lost" shards of the degraded objects were obviously still on the > cluster somewhere. Is there any way to force the cluster to rescan OSDs for > the shards that went orphan during the incident? > > > > 3) This smells a bit like a bug that requires attention. I was probably > just lucky that I only lost 1 shard per PG. Has something similar reported > before? Is this fixed in 13.2.10? Is it something new? Any settings that > need to be looked at? If logs need to be collected, I can do so during my > next attempt. However, I cannot risk data integrity of a production cluster > and, therefore, probably not run the original procedure again. > > > > Many thanks for your help and best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx