Morning all,
This has been a rough couple days. We thought we had resolved all our performance issues by moving the ceph metadata to some high intensity write disks from Intel but what we didn't notice was that Ceph labeled them as HDD's (thanks dell raid controller).
We believe this caused read lock errors and resulted in the journal increasing from 700MB to 1 TB in 2 hours. (Basically over lunch) We tried to migrate and then stop everything before the OSD's reached full status but failed.
Over the last 12 hours the data has been migrated from the SDD's back to spinning disks but the MDS servers are now reporting that two ranks are damaged.
We are running a backup of the metadata pool but wanted to know what the list thinks the next steps should be. I have attached the error's we see in the logs as well as our OSD Tree, ceph.conf (comments removed), and ceph fs dump.
Combined logs (after marking things as repaired to see if that would rescue us):
Nov 1 10:07:02 ceph-p-mds2 ceph-mds: 2018-11-01 10:07:02.045499 7f68db7a3700 -1 mds.4.purge_queue operator(): Error -108 loading Journaler
Nov 1 10:07:02 ceph-p-mds2 ceph-mds: 2018-11-01 10:07:02.045499 7f68db7a3700 -1 mds.4.purge_queue operator(): Error -108 loading Journaler
Nov 1 10:26:40 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:40.968143 7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged (MDS_DAMAGE)
Nov 1 10:26:40 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:40.968143 7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged (MDS_DAMAGE)
Nov 1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914934 7f6dacd69700 -1 mds.1.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
Nov 1 10:26:47 ceph-storage2 ceph-mds: mds.1 10.141.255.202:6898/1492854021 1 : Error loading MDS rank 1: (22) Invalid argument
Nov 1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914949 7f6dacd69700 0 mds.1.log _replay journaler got error -22, aborting
Nov 1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914934 7f6dacd69700 -1 mds.1.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
Nov 1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.915745 7f6dacd69700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 1: (22) Invalid argument
Nov 1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.915745 7f6dacd69700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 1: (22) Invalid argument
Nov 1 10:26:47 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:47.999432 7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update: 2 mds daemons damaged (MDS_DAMAGE)
Nov 1 10:26:47 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:47.999432 7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update: 2 mds daemons damaged (MDS_DAMAGE)
Nov 1 10:26:55 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:55.026231 7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged (MDS_DAMAGE)
Nov 1 10:26:55 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:55.026231 7fa3b57ce700 -1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged (MDS_DAMAGE)
cluster:
id: 6a2e8f21-bca2-492b-8869-eecc995216cc
health: HEALTH_ERR
1 filesystem is degraded
2 mds daemons damaged
services:
mon: 3 daemons, quorum ceph-p-mon2,ceph-p-mon1,ceph-p-mon3
mgr: ceph-p-mon1(active), standbys: ceph-p-mon2
mds: cephfs-3/5/5 up {0=ceph-storage3=up:resolve,2=ceph-p-mon3=up:resolve,4=ceph-p-mds1=up:resolve}, 3 up:standby, 2 damaged
osd: 170 osds: 167 up, 158 in
data:
pools: 7 pools, 7520 pgs
objects: 188.46M objects, 161TiB
usage: 275TiB used, 283TiB / 558TiB avail
pgs: 7511 active+clean
9 active+clean+scrubbing+deep
io:
client: 0B/s rd, 17.2KiB/s wr, 0op/s rd, 1op/s wr
Ceph OSD Tree:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-10 0 root deefault
-9 5.53958 root ssds
-11 1.89296 host ceph-cache1
35 hdd 1.09109 osd.35 up 0 1.00000
181 hdd 0.26729 osd.181 up 0 1.00000
182 hdd 0.26729 osd.182 down 0 1.00000
183 hdd 0.26729 osd.183 down 0 1.00000
-12 1.75366 host ceph-cache2
46 hdd 1.09109 osd.46 up 0 1.00000
185 hdd 0.26729 osd.185 down 0 1.00000
186 hdd 0.12799 osd.186 up 0 1.00000
187 hdd 0.26729 osd.187 up 0 1.00000
-13 1.89296 host ceph-cache3
60 hdd 1.09109 osd.60 up 0 1.00000
189 hdd 0.26729 osd.189 up 0 1.00000
190 hdd 0.26729 osd.190 up 0 1.00000
191 hdd 0.26729 osd.191 up 0 1.00000
-5 4.33493 root ssds-ro
-6 1.44498 host ceph-storage1-ssd
85 ssd 0.72249 osd.85 up 1.00000 1.00000
89 ssd 0.72249 osd.89 up 1.00000 1.00000
-7 1.44498 host ceph-storage2-ssd
5 ssd 0.72249 osd.5 up 1.00000 1.00000
68 ssd 0.72249 osd.68 up 1.00000 1.00000
-8 1.44498 host ceph-storage3-ssd
160 ssd 0.72249 osd.160 up 1.00000 1.00000
163 ssd 0.72249 osd.163 up 1.00000 1.00000
-1 552.07568 root default
-2 177.96744 host ceph-storage1
0 hdd 3.63199 osd.0 up 1.00000 1.00000
1 hdd 3.63199 osd.1 up 1.00000 1.00000
3 hdd 3.63199 osd.3 up 1.00000 1.00000
4 hdd 3.63199 osd.4 up 1.00000 1.00000
6 hdd 3.63199 osd.6 up 1.00000 1.00000
8 hdd 3.63199 osd.8 up 1.00000 1.00000
11 hdd 3.63199 osd.11 up 1.00000 1.00000
13 hdd 3.63199 osd.13 up 1.00000 1.00000
15 hdd 3.63199 osd.15 up 1.00000 1.00000
18 hdd 3.63199 osd.18 up 1.00000 1.00000
20 hdd 3.63199 osd.20 up 1.00000 1.00000
22 hdd 3.63199 osd.22 up 1.00000 1.00000
25 hdd 3.63199 osd.25 up 1.00000 1.00000
27 hdd 3.63199 osd.27 up 1.00000 1.00000
29 hdd 3.63199 osd.29 up 1.00000 1.00000
32 hdd 3.63199 osd.32 up 1.00000 1.00000
34 hdd 3.63199 osd.34 up 1.00000 1.00000
36 hdd 3.63199 osd.36 up 1.00000 1.00000
39 hdd 3.63199 osd.39 up 1.00000 1.00000
41 hdd 3.63199 osd.41 up 1.00000 1.00000
43 hdd 3.63199 osd.43 up 1.00000 1.00000
48 hdd 3.63199 osd.48 up 1.00000 1.00000
50 hdd 3.63199 osd.50 up 1.00000 1.00000
52 hdd 3.63199 osd.52 up 1.00000 1.00000
55 hdd 3.63199 osd.55 up 1.00000 1.00000
62 hdd 3.63199 osd.62 up 1.00000 1.00000
65 hdd 3.63199 osd.65 up 1.00000 1.00000
66 hdd 3.63199 osd.66 up 1.00000 1.00000
67 hdd 3.63199 osd.67 up 1.00000 1.00000
70 hdd 3.63199 osd.70 up 1.00000 1.00000
72 hdd 3.63199 osd.72 up 1.00000 1.00000
74 hdd 3.63199 osd.74 up 1.00000 1.00000
76 hdd 3.63199 osd.76 up 1.00000 1.00000
79 hdd 3.63199 osd.79 up 1.00000 1.00000
92 hdd 3.63199 osd.92 up 1.00000 1.00000
94 hdd 3.63199 osd.94 up 1.00000 1.00000
97 hdd 3.63199 osd.97 up 1.00000 1.00000
99 hdd 3.63199 osd.99 up 1.00000 1.00000
101 hdd 3.63199 osd.101 up 1.00000 1.00000
104 hdd 3.63199 osd.104 up 1.00000 1.00000
107 hdd 3.63199 osd.107 up 1.00000 1.00000
111 hdd 3.63199 osd.111 up 1.00000 1.00000
112 hdd 3.63199 osd.112 up 1.00000 1.00000
114 hdd 3.63199 osd.114 up 1.00000 1.00000
117 hdd 3.63199 osd.117 up 1.00000 1.00000
119 hdd 3.63199 osd.119 up 1.00000 1.00000
131 hdd 3.63199 osd.131 up 1.00000 1.00000
137 hdd 3.63199 osd.137 up 1.00000 1.00000
139 hdd 3.63199 osd.139 up 1.00000 1.00000
-4 177.96744 host ceph-storage2
7 hdd 3.63199 osd.7 up 1.00000 1.00000
10 hdd 3.63199 osd.10 up 1.00000 1.00000
12 hdd 3.63199 osd.12 up 1.00000 1.00000
14 hdd 3.63199 osd.14 up 1.00000 1.00000
16 hdd 3.63199 osd.16 up 1.00000 1.00000
19 hdd 3.63199 osd.19 up 1.00000 1.00000
21 hdd 3.63199 osd.21 up 1.00000 1.00000
23 hdd 3.63199 osd.23 up 1.00000 1.00000
26 hdd 3.63199 osd.26 up 1.00000 1.00000
28 hdd 3.63199 osd.28 up 1.00000 1.00000
30 hdd 3.63199 osd.30 up 1.00000 1.00000
33 hdd 3.63199 osd.33 up 1.00000 1.00000
37 hdd 3.63199 osd.37 up 1.00000 1.00000
40 hdd 3.63199 osd.40 up 1.00000 1.00000
42 hdd 3.63199 osd.42 up 1.00000 1.00000
44 hdd 3.63199 osd.44 up 1.00000 1.00000
47 hdd 3.63199 osd.47 up 1.00000 1.00000
49 hdd 3.63199 osd.49 up 1.00000 1.00000
51 hdd 3.63199 osd.51 up 1.00000 1.00000
54 hdd 3.63199 osd.54 up 1.00000 1.00000
56 hdd 3.63199 osd.56 up 1.00000 1.00000
57 hdd 3.63199 osd.57 up 1.00000 1.00000
59 hdd 3.63199 osd.59 up 1.00000 1.00000
61 hdd 3.63199 osd.61 up 1.00000 1.00000
63 hdd 3.63199 osd.63 up 1.00000 1.00000
71 hdd 3.63199 osd.71 up 1.00000 1.00000
73 hdd 3.63199 osd.73 up 1.00000 1.00000
75 hdd 3.63199 osd.75 up 1.00000 1.00000
78 hdd 3.63199 osd.78 up 1.00000 1.00000
80 hdd 3.63199 osd.80 up 1.00000 1.00000
81 hdd 3.63199 osd.81 up 1.00000 1.00000
83 hdd 3.63199 osd.83 up 1.00000 1.00000
84 hdd 3.63199 osd.84 up 1.00000 1.00000
90 hdd 3.63199 osd.90 up 1.00000 1.00000
91 hdd 3.63199 osd.91 up 1.00000 1.00000
93 hdd 3.63199 osd.93 up 1.00000 1.00000
96 hdd 3.63199 osd.96 up 1.00000 1.00000
98 hdd 3.63199 osd.98 up 1.00000 1.00000
100 hdd 3.63199 osd.100 up 1.00000 1.00000
102 hdd 3.63199 osd.102 up 1.00000 1.00000
105 hdd 3.63199 osd.105 up 1.00000 1.00000
106 hdd 3.63199 osd.106 up 1.00000 1.00000
108 hdd 3.63199 osd.108 up 1.00000 1.00000
110 hdd 3.63199 osd.110 up 1.00000 1.00000
115 hdd 3.63199 osd.115 up 1.00000 1.00000
116 hdd 3.63199 osd.116 up 1.00000 1.00000
121 hdd 3.63199 osd.121 up 1.00000 1.00000
123 hdd 3.63199 osd.123 up 1.00000 1.00000
132 hdd 3.63199 osd.132 up 1.00000 1.00000
-3 196.14078 host ceph-storage3
2 hdd 3.63199 osd.2 up 1.00000 1.00000
9 hdd 3.63199 osd.9 up 1.00000 1.00000
17 hdd 3.63199 osd.17 up 1.00000 1.00000
24 hdd 3.63199 osd.24 up 1.00000 1.00000
31 hdd 3.63199 osd.31 up 1.00000 1.00000
38 hdd 3.63199 osd.38 up 1.00000 1.00000
45 hdd 3.63199 osd.45 up 1.00000 1.00000
53 hdd 3.63199 osd.53 up 1.00000 1.00000
58 hdd 3.63199 osd.58 up 1.00000 1.00000
64 hdd 3.63199 osd.64 up 1.00000 1.00000
69 hdd 3.63199 osd.69 up 1.00000 1.00000
77 hdd 3.63199 osd.77 up 1.00000 1.00000
82 hdd 3.63199 osd.82 up 1.00000 1.00000
86 hdd 3.63199 osd.86 up 1.00000 1.00000
88 hdd 3.63199 osd.88 up 1.00000 1.00000
95 hdd 3.63199 osd.95 up 1.00000 1.00000
103 hdd 3.63199 osd.103 up 1.00000 1.00000
109 hdd 3.63199 osd.109 up 1.00000 1.00000
113 hdd 3.63199 osd.113 up 1.00000 1.00000
120 hdd 3.63199 osd.120 up 1.00000 1.00000
127 hdd 3.63199 osd.127 up 1.00000 1.00000
134 hdd 3.63199 osd.134 up 1.00000 1.00000
140 hdd 3.63869 osd.140 up 1.00000 1.00000
141 hdd 3.63199 osd.141 up 1.00000 1.00000
143 hdd 3.63199 osd.143 up 1.00000 1.00000
144 hdd 3.63199 osd.144 up 1.00000 1.00000
145 hdd 3.63199 osd.145 up 1.00000 1.00000
146 hdd 3.63199 osd.146 up 1.00000 1.00000
147 hdd 3.63199 osd.147 up 1.00000 1.00000
148 hdd 3.63199 osd.148 up 1.00000 1.00000
149 hdd 3.63199 osd.149 up 1.00000 1.00000
150 hdd 3.63199 osd.150 up 1.00000 1.00000
151 hdd 3.63199 osd.151 up 1.00000 1.00000
152 hdd 3.63199 osd.152 up 1.00000 1.00000
153 hdd 3.63199 osd.153 up 1.00000 1.00000
154 hdd 3.63199 osd.154 up 1.00000 1.00000
155 hdd 3.63199 osd.155 up 1.00000 1.00000
156 hdd 3.63199 osd.156 up 1.00000 1.00000
157 hdd 3.63199 osd.157 up 1.00000 1.00000
158 hdd 3.63199 osd.158 up 1.00000 1.00000
159 hdd 3.63199 osd.159 up 1.00000 1.00000
161 hdd 3.63199 osd.161 up 1.00000 1.00000
162 hdd 3.63199 osd.162 up 1.00000 1.00000
164 hdd 3.63199 osd.164 up 1.00000 1.00000
165 hdd 3.63199 osd.165 up 1.00000 1.00000
167 hdd 3.63199 osd.167 up 1.00000 1.00000
168 hdd 3.63199 osd.168 up 1.00000 1.00000
169 hdd 3.63199 osd.169 up 1.00000 1.00000
170 hdd 3.63199 osd.170 up 1.00000 1.00000
171 hdd 3.63199 osd.171 up 1.00000 1.00000
172 hdd 3.63199 osd.172 up 1.00000 1.00000
173 hdd 3.63199 osd.173 up 1.00000 1.00000
174 hdd 3.63869 osd.174 up 1.00000 1.00000
177 hdd 3.63199 osd.177 up 1.00000 1.00000
# Ceph configuration shared by all nodes
[global]
fsid = 6a2e8f21-bca2-492b-8869-eecc995216cc
public_network = 10.141.0.0/16
cluster_network = 10.85.8.0/22
mon_initial_members = ceph-p-mon1, ceph-p-mon2, ceph-p-mon3
mon_host = 10.141.161.248,10.141.160.250,10.141.167.237
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
# Cephfs needs these to be set to support larger directories
mds_bal_frag = true
allow_dirfrags = true
rbd_default_format = 2
mds_beacon_grace = 60
mds session timeout = 120
log to syslog = true
err to syslog = true
clog to syslog = true
[mds]
[osd]
osd op threads = 32
osd max backfills = 32
# Old method of moving ssds to a pool
[osd.85]
host = ceph-storage1
crush_location = root=ssds host=ceph-storage1-ssd
[osd.89]
host = ceph-storage1
crush_location = root=ssds host=ceph-storage1-ssd
[osd.160]
host = ceph-storage3
crush_location = root=ssds host=ceph-storage3-ssd
[osd.163]
host = ceph-storage3
crush_location = root=ssds host=ceph-storage3-ssd
[osd.166]
host = ceph-storage3
crush_location = root=ssds host=ceph-storage3-ssd
[osd.5]
host = ceph-storage2
crush_location = root=ssds host=ceph-storage2-ssd
[osd.68]
host = ceph-storage2
crush_location = root=ssds host=ceph-storage2-ssd
[osd.87]
host = ceph-storage2
crush_location = root=ssds host=ceph-storage2-ssd
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: Tuesday, October 30, 2018 8:40 PM To: Rhian Resnick Cc: Ceph Users Subject: Re: Removing MDS On Tue, Oct 30, 2018 at 4:05 PM Rhian Resnick <rresnick@xxxxxxx> wrote:
> We are running into issues deactivating mds ranks. Is there a way to safely forcibly remove a rank? No, there's no "safe" way to force the issue. The rank needs to come back, flush its journal, and then complete its deactivation. To get more help, you need to describe your environment, version of Ceph in use, relevant log snippets, etc. -- Patrick Donnelly |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com