Re: Help recovering failed cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Had aa little bit of help in IRC, was asked to attach the OSD tree, health detail and crush map. PG dump is included at the link below - too big to attach directly.



DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received it by mistake, please let us know by email reply and delete it from your system; you should not disseminate, distribute or copy this email.
ID  WEIGHT   TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 49.08995 root default
 -2  3.64000     host zulu
  0  3.64000         osd.0         up  1.00000          1.00000
 -3        0     host yankee
 -4  3.63998     host xray
  2  1.81999         osd.2         up  1.00000          1.00000
  3  1.81999         osd.3         up  1.00000          1.00000
 -5  7.28000     host whiskey
  4  3.64000         osd.4         up  1.00000          1.00000
  5  3.64000         osd.5         up  1.00000          1.00000
 -6  7.28000     host victor
  6  3.64000         osd.6         up  1.00000          1.00000
  7  3.64000         osd.7         up  1.00000          1.00000
 -7  7.28000     host sierra
  8  3.64000         osd.8         up  1.00000          1.00000
  9  3.64000         osd.9         up  1.00000          1.00000
 -8  7.28000     host uniform
 10  3.64000         osd.10        up  1.00000          1.00000
 11  3.64000         osd.11        up  1.00000          1.00000
 -9  5.43999     host alpha
 12  1.81000         osd.12        up  1.00000          1.00000
 13  3.62999         osd.13        up  1.00000          1.00000
-10  3.62000     host bravo
 14  1.81000         osd.14        up  1.00000          1.00000
 15  1.81000         osd.15        up  1.00000          1.00000
-11  3.62999     host charlie
 16  3.62999         osd.16        up  1.00000          1.00000
HEALTH_WARN 5 pgs degraded; 13 pgs down; 48 pgs incomplete; 4 pgs recovering; 1 pgs recovery_wait; 76 pgs stale; 5 pgs stuck degraded; 48 pgs stuck inactive; 76 pgs stuck stale; 53 pgs stuck unclean; 5 pgs stuck undersized; 5 pgs undersized; 402 requests are blocked > 32 sec; 6 osds have slow requests; recovery 14656/6951979 objects degraded (0.211%); recovery 20585/6951979 objects misplaced (0.296%); recovery 5/3348270 unfound (0.000%)
pg 4.2c3 is stuck inactive for 117897.973036, current state incomplete, last acting [11,13,9]
pg 2.201 is stuck inactive since forever, current state incomplete, last acting [16,8]
pg 2.384 is stuck inactive since forever, current state incomplete, last acting [14,13]
pg 2.dd is stuck inactive since forever, current state down+incomplete, last acting [9,16]
pg 2.262 is stuck inactive for 115462.365392, current state incomplete, last acting [10,7]
pg 4.138 is stuck inactive for 117897.966027, current state incomplete, last acting [10,12,0]
pg 2.383 is stuck inactive for 115322.222111, current state incomplete, last acting [10,16]
pg 2.2bc is stuck inactive since forever, current state down+incomplete, last acting [4,11]
pg 4.37a is stuck inactive since forever, current state incomplete, last acting [11,13,7]
pg 2.77 is stuck inactive since forever, current state down+incomplete, last acting [0,5]
pg 2.319 is stuck inactive for 116624.074955, current state incomplete, last acting [10,9]
pg 2.3df is stuck inactive for 115842.671104, current state incomplete, last acting [13,2]
pg 4.c7 is stuck inactive since forever, current state down+incomplete, last acting [7,10,12]
pg 2.126 is stuck inactive since forever, current state down+incomplete, last acting [4,15]
pg 4.120 is stuck inactive for 117897.972022, current state down+incomplete, last acting [11,14,9]
pg 2.62 is stuck inactive since forever, current state down+incomplete, last acting [0,12]
pg 4.c5 is stuck inactive for 115009.385001, current state incomplete, last acting [13,10,3]
pg 2.1e1 is stuck inactive since forever, current state down+incomplete, last acting [14,10]
pg 4.b8 is stuck inactive since forever, current state incomplete, last acting [9,13,10]
pg 2.bb is stuck inactive since forever, current state incomplete, last acting [8,13]
pg 2.1de is stuck inactive for 116357.554959, current state incomplete, last acting [13,5]
pg 4.3c5 is stuck inactive for 115009.385113, current state incomplete, last acting [13,11,7]
pg 4.50 is stuck inactive for 117897.970869, current state incomplete, last acting [10,6,9]
pg 4.170 is stuck inactive since forever, current state incomplete, last acting [15,12,9]
pg 4.292 is stuck inactive for 117897.973271, current state incomplete, last acting [11,6,9]
pg 2.35a is stuck inactive since forever, current state down+incomplete, last acting [6,11]
pg 2.52 is stuck inactive since forever, current state incomplete, last acting [4,13]
pg 4.163 is stuck inactive for 117897.972450, current state incomplete, last acting [11,6,0]
pg 2.227 is stuck inactive for 116357.556589, current state incomplete, last acting [13,7]
pg 2.226 is stuck inactive for 117897.983453, current state incomplete, last acting [13,7]
pg 4.1c6 is stuck inactive for 117897.965525, current state incomplete, last acting [10,7,13]
pg 2.344 is stuck inactive since forever, current state down+incomplete, last acting [8,4]
pg 2.2e0 is stuck inactive since forever, current state incomplete, last acting [2,13]
pg 2.fe is stuck inactive for 115848.766175, current state down+incomplete, last acting [10,4]
pg 4.159 is stuck inactive for 115009.384715, current state incomplete, last acting [13,5,11]
pg 2.3a3 is stuck inactive since forever, current state incomplete, last acting [3,16]
pg 2.35 is stuck inactive since forever, current state incomplete, last acting [15,11]
pg 2.9a is stuck inactive since forever, current state incomplete, last acting [0,12]
pg 4.9c is stuck inactive for 117897.969774, current state incomplete, last acting [10,16,9]
pg 4.21c is stuck inactive for 115009.384541, current state incomplete, last acting [13,9,5]
pg 4.1b7 is stuck inactive for 117897.965128, current state incomplete, last acting [10,12,6]
pg 4.2d0 is stuck inactive since forever, current state incomplete, last acting [3,12,7]
pg 2.3f5 is stuck inactive since forever, current state incomplete, last acting [16,11]
pg 2.2e is stuck inactive for 115502.904557, current state incomplete, last acting [11,16]
pg 4.391 is stuck inactive since forever, current state incomplete, last acting [2,10,13]
pg 2.212 is stuck inactive for 115464.292511, current state down+incomplete, last acting [11,15]
pg 4.8c is stuck inactive for 115431.950594, current state incomplete, last acting [11,16,13]
pg 2.84 is stuck inactive since forever, current state down+incomplete, last acting [0,11]
pg 2.23 is stuck unclean for 115434.708856, current state active+recovering+undersized+degraded+remapped, last acting [5]
pg 4.2c3 is stuck unclean for 117897.973281, current state incomplete, last acting [11,13,9]
pg 2.201 is stuck unclean since forever, current state incomplete, last acting [16,8]
pg 2.384 is stuck unclean since forever, current state incomplete, last acting [14,13]
pg 2.dd is stuck unclean since forever, current state down+incomplete, last acting [9,16]
pg 2.262 is stuck unclean for 115462.365633, current state incomplete, last acting [10,7]
pg 4.138 is stuck unclean for 117897.966268, current state incomplete, last acting [10,12,0]
pg 2.383 is stuck unclean for 115322.222352, current state incomplete, last acting [10,16]
pg 2.2bc is stuck unclean since forever, current state down+incomplete, last acting [4,11]
pg 2.2be is stuck unclean for 115321.894026, current state active+recovering+undersized+degraded+remapped, last acting [8]
pg 4.37a is stuck unclean since forever, current state incomplete, last acting [11,13,7]
pg 2.77 is stuck unclean since forever, current state down+incomplete, last acting [0,5]
pg 2.319 is stuck unclean for 116624.075199, current state incomplete, last acting [10,9]
pg 2.3df is stuck unclean for 115842.671348, current state incomplete, last acting [13,2]
pg 2.12d is stuck unclean for 115434.715198, current state active+recovery_wait+undersized+degraded+remapped, last acting [5]
pg 4.c7 is stuck unclean since forever, current state down+incomplete, last acting [7,10,12]
pg 4.120 is stuck unclean for 117897.972267, current state down+incomplete, last acting [11,14,9]
pg 2.126 is stuck unclean since forever, current state down+incomplete, last acting [4,15]
pg 2.62 is stuck unclean since forever, current state down+incomplete, last acting [0,12]
pg 4.c5 is stuck unclean for 115475.090411, current state incomplete, last acting [13,10,3]
pg 2.1e1 is stuck unclean since forever, current state down+incomplete, last acting [14,10]
pg 4.b8 is stuck unclean since forever, current state incomplete, last acting [9,13,10]
pg 2.1de is stuck unclean for 116357.555208, current state incomplete, last acting [13,5]
pg 4.3c5 is stuck unclean for 115475.090272, current state incomplete, last acting [13,11,7]
pg 2.bb is stuck unclean since forever, current state incomplete, last acting [8,13]
pg 4.50 is stuck unclean for 117897.971122, current state incomplete, last acting [10,6,9]
pg 4.170 is stuck unclean since forever, current state incomplete, last acting [15,12,9]
pg 4.292 is stuck unclean for 117897.973522, current state incomplete, last acting [11,6,9]
pg 2.35a is stuck unclean since forever, current state down+incomplete, last acting [6,11]
pg 2.52 is stuck unclean since forever, current state incomplete, last acting [4,13]
pg 2.291 is stuck unclean for 115687.844694, current state active+recovering+undersized+degraded+remapped, last acting [6]
pg 2.168 is stuck unclean for 115497.769029, current state active+recovering+undersized+degraded+remapped, last acting [7]
pg 4.163 is stuck unclean for 117897.972707, current state incomplete, last acting [11,6,0]
pg 2.227 is stuck unclean for 116357.556846, current state incomplete, last acting [13,7]
pg 2.226 is stuck unclean for 117897.983709, current state incomplete, last acting [13,7]
pg 4.1c6 is stuck unclean for 117897.965780, current state incomplete, last acting [10,7,13]
pg 2.344 is stuck unclean since forever, current state down+incomplete, last acting [8,4]
pg 2.2e0 is stuck unclean since forever, current state incomplete, last acting [2,13]
pg 2.fe is stuck unclean for 115848.766430, current state down+incomplete, last acting [10,4]
pg 4.159 is stuck unclean for 115475.089122, current state incomplete, last acting [13,5,11]
pg 2.3a3 is stuck unclean since forever, current state incomplete, last acting [3,16]
pg 2.35 is stuck unclean since forever, current state incomplete, last acting [15,11]
pg 2.9a is stuck unclean since forever, current state incomplete, last acting [0,12]
pg 4.9c is stuck unclean for 117897.970027, current state incomplete, last acting [10,16,9]
pg 4.21c is stuck unclean for 115475.089821, current state incomplete, last acting [13,9,5]
pg 4.1b7 is stuck unclean for 117897.965380, current state incomplete, last acting [10,12,6]
pg 4.2d0 is stuck unclean since forever, current state incomplete, last acting [3,12,7]
pg 2.3f5 is stuck unclean since forever, current state incomplete, last acting [16,11]
pg 4.391 is stuck unclean since forever, current state incomplete, last acting [2,10,13]
pg 2.2e is stuck unclean for 115502.904813, current state incomplete, last acting [11,16]
pg 2.212 is stuck unclean for 115464.292764, current state down+incomplete, last acting [11,15]
pg 4.8c is stuck unclean for 115431.950849, current state incomplete, last acting [11,16,13]
pg 2.84 is stuck unclean since forever, current state down+incomplete, last acting [0,11]
pg 2.23 is stuck undersized for 69192.080083, current state active+recovering+undersized+degraded+remapped, last acting [5]
pg 2.2be is stuck undersized for 68995.608701, current state active+recovering+undersized+degraded+remapped, last acting [8]
pg 2.12d is stuck undersized for 69191.288208, current state active+recovery_wait+undersized+degraded+remapped, last acting [5]
pg 2.291 is stuck undersized for 71577.250338, current state active+recovering+undersized+degraded+remapped, last acting [6]
pg 2.168 is stuck undersized for 68997.206947, current state active+recovering+undersized+degraded+remapped, last acting [7]
pg 2.23 is stuck degraded for 69192.080139, current state active+recovering+undersized+degraded+remapped, last acting [5]
pg 2.2be is stuck degraded for 68995.608757, current state active+recovering+undersized+degraded+remapped, last acting [8]
pg 2.12d is stuck degraded for 69191.288264, current state active+recovery_wait+undersized+degraded+remapped, last acting [5]
pg 2.291 is stuck degraded for 71577.250394, current state active+recovering+undersized+degraded+remapped, last acting [6]
pg 2.168 is stuck degraded for 68997.207003, current state active+recovering+undersized+degraded+remapped, last acting [7]
pg 5.140 is stuck stale for 112497.929357, current state stale+active+clean, last acting [1]
pg 5.2c1 is stuck stale for 112497.928262, current state stale+active+clean, last acting [1]
pg 5.3e2 is stuck stale for 112497.928778, current state stale+active+clean, last acting [1]
pg 5.da is stuck stale for 112497.929205, current state stale+active+clean, last acting [1]
pg 5.326 is stuck stale for 112497.928441, current state stale+active+clean, last acting [1]
pg 5.198 is stuck stale for 112497.929523, current state stale+active+clean, last acting [1]
pg 5.25b is stuck stale for 112497.928113, current state stale+active+clean, last acting [1]
pg 5.2bc is stuck stale for 112497.928270, current state stale+active+clean, last acting [1]
pg 5.a is stuck stale for 112497.928867, current state stale+active+clean, last acting [1]
pg 5.251 is stuck stale for 112497.928103, current state stale+active+clean, last acting [1]
pg 5.18a is stuck stale for 112497.929515, current state stale+active+clean, last acting [1]
pg 5.6e is stuck stale for 112497.929038, current state stale+active+clean, last acting [1]
pg 5.24a is stuck stale for 112497.928103, current state stale+active+clean, last acting [1]
pg 5.188 is stuck stale for 112497.929514, current state stale+active+clean, last acting [1]
pg 5.1e9 is stuck stale for 112497.927920, current state stale+active+clean, last acting [1]
pg 5.18e is stuck stale for 112497.929518, current state stale+active+clean, last acting [1]
pg 5.6 is stuck stale for 112497.928876, current state stale+active+clean, last acting [1]
pg 5.24d is stuck stale for 112497.928105, current state stale+active+clean, last acting [1]
pg 5.2ac is stuck stale for 112497.928281, current state stale+active+clean, last acting [1]
pg 5.30c is stuck stale for 112497.928448, current state stale+active+clean, last acting [1]
pg 5.67 is stuck stale for 112497.929051, current state stale+active+clean, last acting [1]
pg 5.2a2 is stuck stale for 112497.928286, current state stale+active+clean, last acting [1]
pg 5.36d is stuck stale for 112497.928619, current state stale+active+clean, last acting [1]
pg 5.185 is stuck stale for 112497.929525, current state stale+active+clean, last acting [1]
pg 5.361 is stuck stale for 112497.928616, current state stale+active+clean, last acting [1]
pg 5.59 is stuck stale for 112497.929053, current state stale+active+clean, last acting [1]
pg 5.be is stuck stale for 112497.929220, current state stale+active+clean, last acting [1]
pg 5.11e is stuck stale for 112497.929371, current state stale+active+clean, last acting [1]
pg 5.23e is stuck stale for 112497.928109, current state stale+active+clean, last acting [1]
pg 5.3bb is stuck stale for 112497.928790, current state stale+active+clean, last acting [1]
pg 5.b3 is stuck stale for 112497.929219, current state stale+active+clean, last acting [1]
pg 5.29f is stuck stale for 112497.928298, current state stale+active+clean, last acting [1]
pg 5.1d2 is stuck stale for 112497.927940, current state stale+active+clean, last acting [1]
pg 5.29c is stuck stale for 112497.928301, current state stale+active+clean, last acting [1]
pg 5.56 is stuck stale for 112497.929062, current state stale+active+clean, last acting [1]
pg 5.b6 is stuck stale for 112497.929227, current state stale+active+clean, last acting [1]
pg 5.35c is stuck stale for 112497.928636, current state stale+active+clean, last acting [1]
pg 5.3bf is stuck stale for 112497.928803, current state stale+active+clean, last acting [1]
pg 5.54 is stuck stale for 112497.929069, current state stale+active+clean, last acting [1]
pg 5.236 is stuck stale for 112497.928126, current state stale+active+clean, last acting [1]
pg 5.22a is stuck stale for 112497.928123, current state stale+active+clean, last acting [1]
pg 5.2f4 is stuck stale for 112497.928467, current state stale+active+clean, last acting [1]
pg 5.3b7 is stuck stale for 112497.928806, current state stale+active+clean, last acting [1]
pg 5.169 is stuck stale for 112497.929566, current state stale+active+clean, last acting [1]
pg 5.34a is stuck stale for 112497.928639, current state stale+active+clean, last acting [1]
pg 5.3aa is stuck stale for 112497.928810, current state stale+active+clean, last acting [1]
pg 5.16c is stuck stale for 112497.929569, current state stale+active+clean, last acting [1]
pg 5.40 is stuck stale for 112497.929073, current state stale+active+clean, last acting [1]
pg 5.34e is stuck stale for 112497.928646, current state stale+active+clean, last acting [1]
pg 5.1c1 is stuck stale for 112497.927958, current state stale+active+clean, last acting [1]
pg 5.3a is stuck stale for 112497.929075, current state stale+active+clean, last acting [1]
pg 5.167 is stuck stale for 112497.929569, current state stale+active+clean, last acting [1]
pg 5.1c4 is stuck stale for 112497.927958, current state stale+active+clean, last acting [1]
pg 5.15e is stuck stale for 112497.929565, current state stale+active+clean, last acting [1]
pg 5.2de is stuck stale for 112497.928468, current state stale+active+clean, last acting [1]
pg 5.398 is stuck stale for 112497.928803, current state stale+active+clean, last acting [1]
pg 5.33e is stuck stale for 112497.928640, current state stale+active+clean, last acting [1]
pg 5.212 is stuck stale for 112497.928122, current state stale+active+clean, last acting [1]
pg 5.f1 is stuck stale for 112497.929399, current state stale+active+clean, last acting [1]
pg 5.39f is stuck stale for 112497.928803, current state stale+active+clean, last acting [1]
pg 5.273 is stuck stale for 112497.928320, current state stale+active+clean, last acting [1]
pg 5.332 is stuck stale for 112497.928646, current state stale+active+clean, last acting [1]
pg 5.8a is stuck stale for 112497.929249, current state stale+active+clean, last acting [1]
pg 5.1b4 is stuck stale for 112497.927948, current state stale+active+clean, last acting [1]
pg 5.217 is stuck stale for 112497.928136, current state stale+active+clean, last acting [1]
pg 5.2d1 is stuck stale for 112497.928480, current state stale+active+clean, last acting [1]
pg 5.29 is stuck stale for 112497.929084, current state stale+active+clean, last acting [1]
pg 5.336 is stuck stale for 112497.928653, current state stale+active+clean, last acting [1]
pg 5.89 is stuck stale for 112497.929263, current state stale+active+clean, last acting [1]
pg 5.394 is stuck stale for 112497.928824, current state stale+active+clean, last acting [1]
pg 5.14c is stuck stale for 112497.929577, current state stale+active+clean, last acting [1]
pg 5.26e is stuck stale for 112497.928335, current state stale+active+clean, last acting [1]
pg 5.20c is stuck stale for 112497.928138, current state stale+active+clean, last acting [1]
pg 5.2ce is stuck stale for 112497.928490, current state stale+active+clean, last acting [1]
pg 5.3eb is stuck stale for 112497.929006, current state stale+active+clean, last acting [1]
pg 5.142 is stuck stale for 112497.929588, current state stale+active+clean, last acting [1]
pg 4.170 is incomplete, acting [15,12,9]
pg 2.168 is active+recovering+undersized+degraded+remapped, acting [7], 1 unfound
pg 4.163 is incomplete, acting [11,6,0]
pg 4.159 is incomplete, acting [13,5,11]
pg 4.138 is incomplete, acting [10,12,0]
pg 2.12d is active+recovery_wait+undersized+degraded+remapped, acting [5], 1 unfound
pg 2.126 is down+incomplete, acting [4,15]
pg 4.120 is down+incomplete, acting [11,14,9]
pg 2.fe is down+incomplete, acting [10,4]
pg 2.dd is down+incomplete, acting [9,16]
pg 4.c7 is down+incomplete, acting [7,10,12]
pg 4.c5 is incomplete, acting [13,10,3]
pg 4.b8 is incomplete, acting [9,13,10]
pg 2.bb is incomplete, acting [8,13]
pg 4.9c is incomplete, acting [10,16,9]
pg 2.9a is incomplete, acting [0,12]
pg 4.8c is incomplete, acting [11,16,13]
pg 2.84 is down+incomplete, acting [0,11]
pg 2.77 is down+incomplete, acting [0,5]
pg 2.62 is down+incomplete, acting [0,12]
pg 4.50 is incomplete, acting [10,6,9]
pg 2.52 is incomplete, acting [4,13]
pg 2.35 is incomplete, acting [15,11]
pg 2.2e is incomplete, acting [11,16]
pg 2.23 is active+recovering+undersized+degraded+remapped, acting [5], 1 unfound
pg 2.3f5 is incomplete, acting [16,11]
pg 2.3df is incomplete, acting [13,2]
pg 4.3c5 is incomplete, acting [13,11,7]
pg 2.3a3 is incomplete, acting [3,16]
pg 4.391 is incomplete, acting [2,10,13]
pg 2.384 is incomplete, acting [14,13]
pg 2.383 is incomplete, acting [10,16]
pg 4.37a is incomplete, acting [11,13,7]
pg 2.35a is down+incomplete, acting [6,11]
pg 2.344 is down+incomplete, acting [8,4]
pg 2.319 is incomplete, acting [10,9]
pg 2.2e0 is incomplete, acting [2,13]
pg 4.2d0 is incomplete, acting [3,12,7]
pg 4.2c3 is incomplete, acting [11,13,9]
pg 2.2bc is down+incomplete, acting [4,11]
pg 2.2be is active+recovering+undersized+degraded+remapped, acting [8], 1 unfound
pg 4.292 is incomplete, acting [11,6,9]
pg 2.291 is active+recovering+undersized+degraded+remapped, acting [6], 1 unfound
pg 2.262 is incomplete, acting [10,7]
pg 2.227 is incomplete, acting [13,7]
pg 2.226 is incomplete, acting [13,7]
pg 4.21c is incomplete, acting [13,9,5]
pg 2.212 is down+incomplete, acting [11,15]
pg 2.201 is incomplete, acting [16,8]
pg 2.1e1 is down+incomplete, acting [14,10]
pg 2.1de is incomplete, acting [13,5]
pg 4.1c6 is incomplete, acting [10,7,13]
pg 4.1b7 is incomplete, acting [10,12,6]
2 ops are blocked > 134218 sec
2 ops are blocked > 16777.2 sec
303 ops are blocked > 8388.61 sec
95 ops are blocked > 4194.3 sec
2 ops are blocked > 16777.2 sec on osd.2
4 ops are blocked > 8388.61 sec on osd.2
1 ops are blocked > 134218 sec on osd.7
4 ops are blocked > 8388.61 sec on osd.7
5 ops are blocked > 8388.61 sec on osd.9
95 ops are blocked > 4194.3 sec on osd.9
100 ops are blocked > 8388.61 sec on osd.10
1 ops are blocked > 134218 sec on osd.11
99 ops are blocked > 8388.61 sec on osd.11
91 ops are blocked > 8388.61 sec on osd.13
6 osds have slow requests
recovery 14656/6951979 objects degraded (0.211%)
recovery 20585/6951979 objects misplaced (0.296%)
recovery 5/3348270 unfound (0.000%)
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 device1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host zulu {
	id -2		# do not change unnecessarily
	# weight 3.640
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 3.640
}
host yankee {
	id -3		# do not change unnecessarily
	# weight 0.000
	alg straw
	hash 0	# rjenkins1
}
host xray {
	id -4		# do not change unnecessarily
	# weight 3.640
	alg straw
	hash 0	# rjenkins1
	item osd.2 weight 1.820
	item osd.3 weight 1.820
}
host whiskey {
	id -5		# do not change unnecessarily
	# weight 7.280
	alg straw
	hash 0	# rjenkins1
	item osd.4 weight 3.640
	item osd.5 weight 3.640
}
host victor {
	id -6		# do not change unnecessarily
	# weight 7.280
	alg straw
	hash 0	# rjenkins1
	item osd.6 weight 3.640
	item osd.7 weight 3.640
}
host sierra {
	id -7		# do not change unnecessarily
	# weight 7.280
	alg straw
	hash 0	# rjenkins1
	item osd.8 weight 3.640
	item osd.9 weight 3.640
}
host uniform {
	id -8		# do not change unnecessarily
	# weight 7.280
	alg straw
	hash 0	# rjenkins1
	item osd.10 weight 3.640
	item osd.11 weight 3.640
}
host alpha {
	id -9		# do not change unnecessarily
	# weight 5.440
	alg straw
	hash 0	# rjenkins1
	item osd.12 weight 1.810
	item osd.13 weight 3.630
}
host bravo {
	id -10		# do not change unnecessarily
	# weight 3.620
	alg straw
	hash 0	# rjenkins1
	item osd.14 weight 1.810
	item osd.15 weight 1.810
}
host charlie {
	id -11		# do not change unnecessarily
	# weight 3.630
	alg straw
	hash 0	# rjenkins1
	item osd.16 weight 3.630
}
root default {
	id -1		# do not change unnecessarily
	# weight 49.090
	alg straw
	hash 0	# rjenkins1
	item zulu weight 3.640
	item yankee weight 0.000
	item xray weight 3.640
	item whiskey weight 7.280
	item victor weight 7.280
	item sierra weight 7.280
	item uniform weight 7.280
	item alpha weight 5.440
	item bravo weight 3.620
	item charlie weight 3.630
}

# rules
rule replicated_ruleset {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type host
	step emit
}

# end crush map


-- 

JOHN BLACKWOOD

P: 905 444 9166F: 905 668 8778Chief Technical Officer
jb@xxxxxxxxxxxxxxxxxxwww.kaiinnovations.com
  • Ontario
  •  
  • Manitoba






On Jun 10, 2016, at 4:25 PM, John Blackwood <jb@xxxxxxxxxxxxxxxxxx> wrote:


We're looking for some assistance recovering data from a failed ceph cluster; or some help determining if it is even possible to recover any data.

Background:
- We were using Ceph with Proxmox following the instructions Proxmox provides (https://pve.proxmox.com/wiki/Ceph_Server); which seems fairly close to the ceph recommendations except that the storage is on the same physical systems that virtual machines are running on. 
- Some of our Proxmox nodes use ZFS, and there is a rare bug where ZFS + Proxmox clustering can result in Proxmox hanging indefinitely
- We were using HA on our proxmox nodes, which means when they hang, they are rebooted (hard) automatically
- Hard reboots are bad for file systems
- Hard reboots mean that Ceph tries to recover - meaning more systems hitting the bug followed by more system restarts and general mayhem

We first ran into issues overnight; and at some point during the process one of the file systems on an OSD was corrupted. We managed to stabilize the systems, however we've not been able to recover the critical data from the pool (about 5-10%). 

Current cluster health:
    cluster 537a3e12-95d8-48c3-9e82-91abbfdf62e0
     health HEALTH_WARN
            5 pgs degraded
            8 pgs down
            48 pgs incomplete
            3 pgs recovering
            1 pgs recovery_wait
            76 pgs stale
            5 pgs stuck degraded
            48 pgs stuck inactive
            76 pgs stuck stale
            53 pgs stuck unclean
            5 pgs stuck undersized
            5 pgs undersized
            74 requests are blocked > 32 sec
            recovery 14656/6951979 objects degraded (0.211%)
            recovery 20585/6951979 objects misplaced (0.296%)
            recovery 5/3348270 unfound (0.000%)
     monmap e7: 7 mons at {0=10.11.0.126:6789/0,1=10.11.0.125:6789/0,2=10.11.0.124:6789/0,3=10.11.0.123:6789/0,4=10.11.0.122:6789/0,5=10.11.0.119:6789/0,6=10.11.0.121:6789/0}
            election epoch 482, quorum 0,1,2,3,4,5,6 5,6,4,3,2,1,0
     osdmap e15746: 16 osds: 16 up, 16 in; 5 remapped pgs
      pgmap v10200890: 3072 pgs, 3 pools, 12914 GB data, 3269 kobjects
            26923 GB used, 23327 GB / 50250 GB avail
            14656/6951979 objects degraded (0.211%)
            20585/6951979 objects misplaced (0.296%)
            5/3348270 unfound (0.000%)
                2943 active+clean
                  76 stale+active+clean
                  40 incomplete
                   8 down+incomplete
                   3 active+recovering+undersized+degraded+remapped
                   1 active+recovery_wait+undersized+degraded+remapped
                   1 active+undersized+degraded+remapped

There are two RBD's which we are looking to recover (out of about 130), totalling about 200GB of data. Those RBDs do not appear to be using any of the PGs which are incomplete or down; but do seem to use ones which are stale+active+clean and so if we read from the mapped RBD it will block indefinitely.

We were looking at http://ceph.com/community/incomplete-pgs-oh-my/ as a means of recovering the incomplete PGs as it does seem that the complete ones are on the corrupted OSD, and most or all were able to be exported without issue; however I'm not sure if this is the correct way to go or if I should be looking at something else. 

-- 

JOHN BLACKWOOD

P: 905 444 9166F: 905 668 8778Chief Technical Officer
jb@xxxxxxxxxxxxxxxxxxwww.kaiinnovations.com
  • Ontario
  •  
  • Manitoba








DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received it by mistake, please let us know by email reply and delete it from your system; you should not disseminate, distribute or copy this email._______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux