The weight of osd.53 wasn't 0.0 and the weight of your current osds aren't 1.0. Where are you getting this from? If you're getting it from ceph osd tree, then you're looking at
the wrong column, the weight is the second column right between the id and the osd name.
If you did ceph osd rm 53, and all of the other steps, then osd.53 will not show up in ceph osd ls or ceph osd tree. Also, that PG shouldn't be able to be blocked on an osd no longer in the cluster. I believe that starting the osd with ceph-osd -i 53 --mkfs
--mkkey started it in your terminal instead of as a daemon. That's why it looked like it was "starting" it was actually just running in your terminal. The data it was receiving slowly is likely to be osd maps and other metadata that can't clean up throughout
the cluster until all of the pgs are healthy. The reason it wasn't receiving data is that you weighted it to 0.0... aka no PGs whatsoever. The default weight for an osd is it's size in GB. It looks like you're doing a basic setup for adding your osds, so
just follow the ceph docs for adding osds to your cluster.
The nobackfill and norecover flags just stop data from moving since you're adding the osd back in with the same weight. Less useless data movement.
From: Salwasser, Zac [zsalwass@xxxxxxxxxx]
Sent: Thursday, July 21, 2016 2:11 PM
To: David Turner; ceph-users@xxxxxxxxxxxxxx
Cc: Heller, Chris
Subject: Re: Uncompactable Monitor Store at 69GB -- Re: Cluster in warn state, not sure what to do next.
Ok, I’ve gotten as far “ceph osd rm 53”. The tree command showed a weight of 0, along with two other “down” osds, both of which I “rebuilt” two days ago (more on this later).
One of the (formerly) two down pgs is still down.
When I run a query on it, I get:
{
"state": "down+incomplete",
"snap_trimq": "[]",
"epoch": 818129,
"up": [
71,
213,
55
],
"acting": [
71,
213,
55
],
"info": {
"pgid": "1.716",
. . .
"probing_osds": [
"23",
. . .
"213"
],
"down_osds_we_would_probe": [
53
],
"peering_blocked_by": []
},
{
"name": "Started",
"enter_time": "2016-07-21 19:52:43.509998"
}
],
"agent_state": {}
}
The first time I tried to remove and add osd 53 (and the other two that the tree indicates are down), I *almost* did it exactly the way you specified, but:
1) I tried to get the weight from the osd map, which seemed to show 0.0, so that was what I set the weight to when I added it back in. I *did* re-use the uuid from the osd map
when I re-created the osds.
2) I did not set the nobackfill and norecover flags.
3) I “re-formatted” by rm -rf the osd data and osd journal directories and then running “ceph-osd -i 53 --mkfs –mkkey”. Should
I have done something different? Will this cover “Make sure to do whatever you need for dmcrypt, journals, etc that are specific
to your environment.”?
All of the other active osds in our tree have a weight of 1.0.
Apologies for all of the questions. Thank you for your help.
From:
David Turner <david.turner@xxxxxxxxxxxxxxxx>
Date: Thursday, July 21, 2016 at 3:24 PM
To: "Salwasser, Zac" <zsalwass@xxxxxxxxxx>, "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
Cc: "Heller, Chris" <cheller@xxxxxxxxxx>
Subject: RE: Uncompactable Monitor Store at 69GB -- Re: Cluster in warn state, not sure what to do next.
The Mon store is important and since your cluster isn't healthy, they need to hold onto it to make sure that when things come up that the mon can replay everything for them.
Once you fix the 2 down and peering PGs, The mon store will fix itself in no time at all. Ceph is rightly refusing to compact that database until your cluster is healthy.
It seems like you have a couple things that might help your setup. First I see something very easy to resolve, and that's the blocked requests. Try running the following command:
ceph osd down 71
That command will tell the cluster that osd.71 is down without restarting the actual osd daemon. Osd.71 will come back and tell the mons it's actually up, but in the mean time the operations blocking on osd.71 will go to a secondary to get the response and
clear up.
Second, osd.53 looks to be causing the never ending peering. A couple questions to check things here. What is your osd_max_backfills set to? That is directly related to how fast osd.53 will fill back up. Something you might do to speed that up is to just
inject a higher setting for osd.53 and not the rest of the cluster:
ceph tell osd.53 injectargs '--osd_max_backfills=20'
If this is the problem and the cluster is just waiting for osd.53 to finish backfilling, then this will get you there faster. I'm unfamiliar with the strategy you used to rebuild the data for osd.53. I would have removed the osd from the cluster and added
it back in with the same weight. That way the osd would start right away and you would see the pgs backfilling onto the osd as opposed to it sitting in a perpetual "booting" state.
To remove the osd with minimal impact to the cluster, the following commands should get you there.
ceph osd tree | grep 'osd.53 '
ceph osd set nobackfill
ceph osd set norecover
#on the host with osd.53, stop the daemon
ceph osd down 53
ceph osd out 53
ceph osd crush remove osd.53
ceph auth rm osd.53
ceph osd rm 53
At this point osd.53 is completely removed from the cluster and you have the original weight of the osd to set it to when you bring the osd back in. The down and peering PGs should now be resolved. Now, completely re-format and add the osd back into the cluster.
Make sure to do whatever you need for dmcrypt, journals, etc that are specific to your environment. Once the osd is back in the cluster, up and in, reweight the osd to what it was before you removed it and unset norecover and nobackfill.
ceph osd crush reweight osd.53 {{ weight_from_tree_command }}
ceph osd unset nobackfill
ceph osd unset norecover
At this point everything is back to the way it was and the osd should start receiving data. The only data movement should be refilling osd.53 with the data it used to have and everything else should stay the same. Increasing the backfills for this osd will
help it fill up faster, but it will be slower for client io if you do. The mon stores will remain "too big" until after backfilling onto osd.53 finishes, but once the data stops moving around and all of your osds are up and in, the mon stores will compact
in no time.
I hope this helps. Ask questions if you have any, and never run a command on your cluster that you don't understand.
David Turner
From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of
Salwasser, Zac [zsalwass@xxxxxxxxxx]
Sent: Thursday, July 21, 2016 12:54 PM
To: ceph-users@xxxxxxxxxxxxxx
Cc: Heller, Chris
Subject: Uncompactable Monitor Store at 69GB -- Re: Cluster in warn state, not sure what to do next.
Rephrasing for brevity – I have a monitor store that is 69GB and won’t compact any further on restart or with ‘tell compact’. Has anyone dealt with this before?
From:
"Salwasser, Zac" <zsalwass@xxxxxxxxxx>
Date: Thursday, July 21, 2016 at 1:18 PM
To: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
Cc: "Salwasser, Zac" <zsalwass@xxxxxxxxxx>, "Heller, Chris" <cheller@xxxxxxxxxx>
Subject: Cluster in warn state, not sure what to do next.
I have a cluster that has been in an unhealthy state for a month or so. We realized the OSDs were flapping due to not having user access to enough file handles, but it took us a while to realize this and we appear
to have done a lot of damage to the state of the monitor store in the meantime.
I’ve been trying to tackle one issue at a time, starting with the size of the monitor store. Compaction, either compact on restart or compact as a ‘tell’ operation, does not shrink the size of the monitor store
any more than it presently is. Having no luck getting the monitor store to shrink, I switched gears to troubleshooting down placement groups. There are two remaining that I cannot fix, and they both claim to be blocked from peering by the same osd (osd.53).
Two days ago, I removed the osd data for osd.53 and restarted it after a ‘mkfs’ operation. It has been in the “booting” state ever since, although there is now 72GB of data in the osd data partition for osd.53,
indicating that some sort of partial “backfilling” has taken place. Watching the host file system indicates that any data coming into that partition at this point is only trickling in.
Here is the output of “ceph health detail”. I’m wondering if anyone would be willing to engage with me to at least get me unstuck. I am on #ceph as salwasser.
HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean; 15 requests are blocked > 32 sec; 1 osds have slow requests; mds0: Behind on trimming (367/30); mds-1: Behind on trimming (364/30);
mon.a65-121-158-160 store is getting too big! 74468 MB >= 15360 MB; mon.a65-121-158-161 store is getting too big! 73881 MB >= 15360 MB; mon.a65-121-158-195 store is getting too big! 64963 MB >= 15360 MB; mon.a65-121-158-196 store is getting too big! 64023
MB >= 15360 MB; mon.a65-121-158-197 store is getting too big! 63632 MB >= 15360 MB
pg 4.285 is stuck inactive since forever, current state down+peering, last acting [28,122,114]
pg 1.716 is stuck inactive for 969017.268003, current state down+peering, last acting [71,213,55]
pg 4.285 is stuck unclean since forever, current state down+peering, last acting [28,122,114]
pg 1.716 is stuck unclean for 969351.417382, current state down+peering, last acting [71,213,55]
pg 1.716 is down+peering, acting [71,213,55]
pg 4.285 is down+peering, acting [28,122,114]
5 ops are blocked > 4194.3 sec
10 ops are blocked > 2097.15 sec
5 ops are blocked > 4194.3 sec on osd.71
10 ops are blocked > 2097.15 sec on osd.71
1 osds have slow requests
mds0: Behind on trimming (367/30)(max_segments: , num_segments: o)
mds-1: Behind on trimming (364/30)(max_segments: , num_segments: l)
mon.a65-121-158-160 store is getting too big! 74468 MB >= 15360 MB -- 53% avail
mon.a65-121-158-161 store is getting too big! 73881 MB >= 15360 MB -- 73% avail
mon.a65-121-158-195 store is getting too big! 64963 MB >= 15360 MB -- 81% avail
mon.a65-121-158-196 store is getting too big! 64023 MB >= 15360 MB -- 81% avail
mon.a65-121-158-197 store is getting too big! 63632 MB >= 15360 MB -- 81% avail
|