mon_compact_on_start was not changed from default (false). From the logs, it looks like the monitor with the excessive resource usage
(mon1) was up and winning the majority of elections throughout the period of unresponsiveness, with other monitors occasionally winning an election without mon1 participating (I’m guessing as it failed to respond).
That’s interesting about the false map updates. We had a short networking blip (caused by me) on some monitors shortly before the trouble
started, which caused some monitors to start calling frequent (every few seconds) elections. Could this rapid creation of new monmaps have the same effect as updating pool settings? Thus causing the monitor to try and clean up in one go, causing the observed
resource usage and unresponsiveness. I’ve been bringing in the storage as you described, I’m in the process of adding 6PB of new storage to a ~10PB (raw) cluster (with
~8PB raw utilisation), so I’m feeling around for the largest backfills we can safely do. I had been weighting up storage in steps that take ~5 days to finish, but have been starting the next reweight as we get to the tail end of the previous, so not giving
the mons time to compact their stores. Although it’s far from ideal (from a total time to get new storage weighted up), I’ll be letting the mons compact between every backfill until I have a better idea of what went on last week.
From: David Turner <drakonstein@xxxxxxxxx>
Generally they clean up slowly by deleting 30 maps every time the maps update. You can speed that up by creating false map updates with something like updating a pool setting to what it already is. What it sounds like happened to you
is that your mon crashed and restarted. If it crashed and has the setting to compact the mon store on start, then it would cause it to forcibly go through and clean everything up in 1 go. I generally plan my backfilling to not take longer than a week. Any longer than that is pretty rough on the mons. You can achieve that by bringing in new storage with a weight of 0.0 and increase it appropriately as opposed to just adding
it with it's full weight and having everything move at once. On Thu, May 17, 2018 at 12:56 PM Thomas Byrne - UKRI STFC <tom.byrne@xxxxxxxxxx> wrote:
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com