Re: Help Basically..

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The problem is with never getting a successful run of `ceph-osd --flush-journal` on the old SSD journal drive. All of the OSDs that used the dead journal need to be removed from the cluster, wiped, and added back in. The data on them is not 100% consistent because the old journal died. Any word that made it to the journal and not the disk is bad.

Add on top of that your decision to run with replica size = 2 min_size = 1, anything that happens in your cluster becomes very dangerous for data loss. Seeing as you had 2 nodes sure near each other, there is a very real possibility that you will have some data loss from this.

Regardless, your first step is to remove the OSDs that were on the failed journal. They are poison in your cluster.

On Sun, Sep 2, 2018, 10:51 AM Lee <lquince@xxxxxxxxx> wrote:
I followed:

$ journal_uuid=$(sudo cat /var/lib/ceph/osd/ceph-0/journal_uuid)
$ sudo sgdisk --new=1:0:+20480M --change-name=1:'ceph journal' --partition-guid=1:$journal_uuid --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdk
Then
$ sudo ceph-osd --mkjournal -i 20
$ sudo service ceph start osd.20
>From https://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/ 
Which they all started without a problem.

On Sun, 2 Sep 2018 at 15:43, David Turner <drakonstein@xxxxxxxxx> wrote:
It looks like osds on the first failed node are having problems. What commands did you run to bring it back online?

On Sun, Sep 2, 2018, 10:27 AM Lee <lquince@xxxxxxxxx> wrote:
Ok I have a lot in the health detail...

root@node31-a4:~# ceph health detail
HEALTH_ERR 64 pgs backfill; 27 pgs backfill_toofull; 39 pgs backfilling; 26 pgs degraded; 4 pgs down; 31 pgs incomplete; 1 pgs inconsistent; 12 pgs recovery_wait; 1 pgs stale; 26 pgs stuck degraded; 31 pgs stuck inactive; 1 pgs stuck stale; 161 pgs stuck unclean; 9 pgs stuck undersized; 9 pgs undersized; 726 requests are blocked > 32 sec; 9 osds have slow requests; recovery 59636/5032695 objects degraded (1.185%); recovery 1280976/5032695 objects misplaced (25.453%); 1 scrub errors; noscrub,nodeep-scrub flag(s) set
pg 2.2a is stuck inactive for 97629.478505, current state incomplete, last acting [24,5]
pg 2.b0 is stuck inactive for 98000.688979, current state incomplete, last acting [24,7]
pg 9.42 is stuck inactive for 108836.103738, current state incomplete, last acting [31,12]
pg 9.de is stuck inactive since forever, current state incomplete, last acting [6,5]
pg 2.75 is stuck inactive since forever, current state down+incomplete, last acting [7,15]
pg 9.dc is stuck inactive for 113491.800208, current state incomplete, last acting [6,7]
pg 2.74 is stuck inactive for 97658.382960, current state incomplete, last acting [13,5]
pg 9.1e is stuck inactive since forever, current state incomplete, last acting [7,15]
pg 2.15 is stuck inactive since forever, current state incomplete, last acting [7,31]
pg 11.1c is stuck inactive since forever, current state down+incomplete, last acting [6,7]
pg 2.a1 is stuck inactive for 98785.888826, current state incomplete, last acting [14,12]
pg 9.d8 is stuck inactive for 115082.575098, current state down+incomplete, last acting [21,5]
pg 9.a8 is stuck inactive for 118575.035210, current state incomplete, last acting [14,7]
pg 9.78 is stuck inactive since forever, current state incomplete, last acting [5,24]
pg 2.a2 is stuck inactive since forever, current state incomplete, last acting [5,13]
pg 7.16 is stuck inactive since forever, current state incomplete, last acting [6,7]
pg 2.13 is stuck inactive since forever, current state incomplete, last acting [7,10]
pg 9.f5 is stuck inactive for 103009.439003, current state incomplete, last acting [18,5]
pg 2.d is stuck inactive since forever, current state incomplete, last acting [5,10]
pg 9.5 is stuck inactive since forever, current state incomplete, last acting [5,18]
pg 9.3 is stuck inactive since forever, current state incomplete, last acting [7,15]
pg 9.fc is stuck inactive for 201476.092908, current state incomplete, last acting [13,5]
pg 11.33 is stuck inactive since forever, current state down+incomplete, last acting [7,6]
pg 9.3f is stuck inactive since forever, current state incomplete, last acting [5,14]
pg 9.a is stuck inactive for 113328.467457, current state incomplete, last acting [18,7]
pg 2.63 is stuck inactive for 97665.176520, current state incomplete, last acting [31,7]
pg 2.3 is stuck inactive for 97655.279670, current state incomplete, last acting [14,5]
pg 2.32 is stuck inactive since forever, current state incomplete, last acting [5,13]
pg 2.bf is stuck inactive for 99913.875808, current state incomplete, last acting [15,7]
pg 9.26 is stuck inactive since forever, current state incomplete, last acting [5,24]
pg 9.22 is stuck inactive since forever, current state incomplete, last acting [7,24]
pg 9.25 is stuck unclean for 20091.777921, current state active+degraded+remapped+wait_backfill, last acting [15,2]
pg 7.2b is stuck unclean for 98830.660179, current state stale+active+undersized+degraded, last acting [5]
pg 11.27 is stuck unclean for 1777813.502308, current state active+remapped+wait_backfill+backfill_toofull, last acting [4,36]
pg 2.f1 is stuck unclean for 26585.481715, current state active+recovery_wait+degraded, last acting [13,8]
pg 9.22 is stuck unclean since forever, current state incomplete, last acting [7,24]
pg 2.29 is stuck unclean for 5629.190514, current state active+remapped+wait_backfill, last acting [24,40]
pg 9.fb is stuck unclean for 3640.777545, current state active+remapped+wait_backfill, last acting [8,39]
pg 9.23 is stuck unclean for 3595.306511, current state active+remapped+wait_backfill, last acting [35,9]
pg 2.f3 is stuck unclean for 4993.558900, current state active+remapped+wait_backfill, last acting [6,9]
pg 2.f2 is stuck unclean for 8871.835444, current state active+recovery_wait+degraded, last acting [6,4]
pg 2.2a is stuck unclean for 97629.478922, current state incomplete, last acting [24,5]
pg 2.ed is stuck unclean for 3595.395657, current state active+remapped+backfilling, last acting [9,40]
pg 2.24 is stuck unclean for 6391.873856, current state active+remapped+wait_backfill, last acting [13,40]
pg 2.27 is stuck unclean for 6814.809178, current state active+recovery_wait+degraded, last acting [13,3]
pg 2.e8 is stuck unclean for 11759.373756, current state active+remapped+wait_backfill, last acting [15,36]
pg 11.29 is stuck unclean for 6907.684021, current state active+remapped+wait_backfill, last acting [14,40]
pg 2.eb is stuck unclean for 14474.951608, current state active+remapped+backfilling, last acting [0,31]
pg 2.ea is stuck unclean for 3595.396597, current state active+remapped+backfilling, last acting [9,34]
pg 12.13 is stuck unclean for 5629.177184, current state active+remapped, last acting [8,31]
pg 2.1d is stuck unclean for 12245.891518, current state active+remapped+backfilling, last acting [3,6]
pg 11.15 is stuck unclean for 14683.173113, current state active+remapped+wait_backfill+backfill_toofull, last acting [34,9]
pg 2.1c is stuck unclean for 14683.755228, current state active+degraded+remapped+backfilling, last acting [14,11]
pg 11.16 is stuck unclean for 5629.180301, current state active+remapped+wait_backfill, last acting [15,40]
pg 2.1f is stuck unclean for 11858.149360, current state active+remapped+wait_backfill, last acting [15,3]
pg 0.1c is stuck unclean for 6907.683196, current state active+remapped+wait_backfill, last acting [12,3]
pg 2.1e is stuck unclean for 102531.318993, current state active+undersized+degraded+remapped+backfilling, last acting [13]
pg 2.e0 is stuck unclean for 3571.898995, current state active+remapped+inconsistent+wait_backfill, last acting [6,9]
pg 2.18 is stuck unclean for 3502.358091, current state active+remapped+backfilling, last acting [18,9]
pg 2.e3 is stuck unclean for 12047.716242, current state active+remapped+backfilling, last acting [4,41]
pg 11.13 is stuck unclean for 6907.682681, current state active+remapped+wait_backfill, last acting [14,8]
pg 9.d6 is stuck unclean for 7416.596559, current state active+remapped+wait_backfill, last acting [1,9]
pg 9.1e is stuck unclean since forever, current state incomplete, last acting [7,15]
pg 11.1c is stuck unclean since forever, current state down+incomplete, last acting [6,7]
pg 2.15 is stuck unclean since forever, current state incomplete, last acting [7,31]
pg 2.dc is stuck unclean for 11709.774640, current state active+remapped+backfilling, last acting [40,4]
pg 2.14 is stuck unclean for 3504.589025, current state active+remapped+backfilling, last acting [18,9]
pg 2.df is stuck unclean for 5047.489499, current state active+remapped+wait_backfill, last acting [0,13]
pg 11.1e is stuck unclean for 1968924.322629, current state active+remapped+wait_backfill, last acting [3,38]
pg 2.de is stuck unclean for 97621.617826, current state active+undersized+degraded+remapped+backfilling, last acting [3]
pg 9.1d is stuck unclean for 48349.818420, current state active+remapped+backfill_toofull, last acting [12,36]
pg 3.17 is stuck unclean for 5629.187939, current state active+remapped, last acting [5,13]
pg 2.d8 is stuck unclean for 7418.583365, current state active+remapped+backfilling, last acting [21,41]
pg 7.15 is stuck unclean for 98830.449502, current state active+remapped+wait_backfill, last acting [13,2]
pg 11.19 is stuck unclean for 3925.828027, current state active+remapped+wait_backfill, last acting [15,38]
pg 2.db is stuck unclean for 3595.396853, current state active+remapped+backfilling, last acting [9,40]
pg 9.18 is stuck unclean for 27500.110917, current state active+remapped+backfill_toofull, last acting [18,13]
pg 7.16 is stuck unclean since forever, current state incomplete, last acting [6,7]
pg 2.13 is stuck unclean since forever, current state incomplete, last acting [7,10]
pg 9.de is stuck unclean since forever, current state incomplete, last acting [6,5]
pg 9.6 is stuck unclean for 219342.087677, current state active+remapped+backfill_toofull, last acting [2,41]
pg 2.d is stuck unclean since forever, current state incomplete, last acting [5,10]
pg 9.df is stuck unclean for 48360.843924, current state active+remapped+wait_backfill+backfill_toofull, last acting [35,2]
pg 8.6 is stuck unclean for 5629.183555, current state active+remapped, last acting [12,13]
pg 2.d7 is stuck unclean for 83782.680541, current state active+undersized+degraded+remapped+backfilling, last acting [36]
pg 9.dc is stuck unclean for 113491.800754, current state incomplete, last acting [6,7]
pg 7.a is stuck unclean for 3844.286529, current state active+remapped+wait_backfill, last acting [38,2]
pg 9.5 is stuck unclean since forever, current state incomplete, last acting [5,18]
pg 4.8 is stuck unclean for 3893.186289, current state active+recovery_wait+degraded, last acting [15,2]
pg 3.d0 is stuck unclean for 7418.584435, current state active+remapped+wait_backfill, last acting [12,2]
pg 2.d1 is stuck unclean for 83769.259615, current state active+undersized+degraded+remapped+backfill_toofull, last acting [36]
pg 9.3 is stuck unclean since forever, current state incomplete, last acting [7,15]
pg 9.d8 is stuck unclean for 115082.575647, current state down+incomplete, last acting [21,5]
pg 2.b is stuck unclean for 7418.564413, current state active+remapped+backfilling, last acting [40,24]
pg 9.d9 is stuck unclean for 14681.601684, current state active+remapped+wait_backfill+backfill_toofull, last acting [39,4]
pg 9.1 is stuck unclean for 3930.973909, current state active+remapped+wait_backfill+backfill_toofull, last acting [39,3]
pg 2.cc is stuck unclean for 5078.643356, current state active+remapped, last acting [40,24]
pg 11.d is stuck unclean for 14592.297817, current state active+remapped+wait_backfill+backfill_toofull, last acting [36,4]
pg 9.c5 is stuck unclean for 3844.281162, current state active+remapped+wait_backfill, last acting [5,38]
pg 9.a is stuck unclean for 113328.467988, current state incomplete, last acting [18,7]
pg 11.9 is stuck unclean for 7418.578072, current state active+remapped+wait_backfill, last acting [21,39]
pg 2.0 is stuck unclean for 97873.488751, current state active+undersized+degraded+remapped+wait_backfill+backfill_toofull, last acting [1]
pg 2.cb is stuck unclean for 25031.035830, current state active+degraded+remapped+wait_backfill+backfill_toofull, last acting [1,4]
pg 9.8 is stuck unclean for 24341.317696, current state active+remapped+wait_backfill+backfill_toofull, last acting [5,24]
pg 2.3 is stuck unclean for 97655.280232, current state incomplete, last acting [14,5]
pg 2.2 is stuck unclean for 97734.492834, current state active+recovery_wait+degraded+remapped, last acting [13,9]
pg 2.c4 is stuck unclean for 3595.525931, current state active+remapped+backfilling, last acting [34,9]
pg 2.c7 is stuck unclean for 8871.729496, current state active+recovery_wait+degraded, last acting [13,2]
pg 9.cb is stuck unclean for 5629.175300, current state active+remapped, last acting [11,31]
pg 9.c9 is stuck unclean for 14683.752701, current state active+remapped+wait_backfill+backfill_toofull, last acting [5,34]
pg 2.c2 is stuck unclean for 3504.738005, current state active+remapped+wait_backfill, last acting [9,15]
pg 2.bd is stuck unclean for 3571.325492, current state active+remapped+backfilling, last acting [39,9]
pg 2.bf is stuck unclean for 99913.876400, current state incomplete, last acting [15,7]
pg 9.b3 is stuck unclean for 3925.828356, current state active+remapped+wait_backfill, last acting [15,35]
pg 2.b5 is stuck unclean for 28026.340079, current state active+remapped, last acting [2,40]
pg 2.b6 is stuck unclean for 11859.834286, current state active+remapped+backfilling, last acting [1,31]
pg 2.b0 is stuck unclean for 98000.689674, current state incomplete, last acting [24,7]
pg 2.b3 is stuck unclean for 5629.182841, current state active+remapped+backfilling, last acting [3,0]
pg 2.ad is stuck unclean for 6907.677050, current state active+remapped+backfilling, last acting [2,39]
pg 2.ae is stuck unclean for 11862.967346, current state active+remapped+backfilling, last acting [34,13]
pg 9.a0 is stuck unclean for 14683.746136, current state active+remapped+wait_backfill+backfill_toofull, last acting [1,3]
pg 2.aa is stuck unclean for 3571.307756, current state active+remapped+backfilling, last acting [40,9]
pg 2.a7 is stuck unclean for 25030.658836, current state active+remapped+wait_backfill, last acting [2,1]
pg 2.a6 is stuck unclean for 3930.913873, current state active+remapped+wait_backfill+backfill_toofull, last acting [2,35]
pg 9.ad is stuck unclean for 8871.819919, current state active+recovery_wait+degraded, last acting [6,8]
pg 2.a1 is stuck unclean for 98785.889529, current state incomplete, last acting [14,12]
pg 1.a0 is stuck unclean for 5629.186426, current state active+remapped, last acting [5,40]
pg 9.a8 is stuck unclean for 118575.035913, current state incomplete, last acting [14,7]
pg 2.a2 is stuck unclean since forever, current state incomplete, last acting [5,13]
pg 2.9d is stuck unclean for 11861.496234, current state active+remapped+backfilling, last acting [6,38]
pg 2.9c is stuck unclean for 3506.888979, current state active+remapped+wait_backfill, last acting [35,11]
pg 2.9b is stuck unclean for 5629.183979, current state active+remapped+wait_backfill, last acting [6,0]
pg 9.91 is stuck unclean for 85752.028652, current state active+remapped+wait_backfill, last acting [31,9]
pg 2.97 is stuck unclean for 9736.783735, current state active+remapped+backfilling, last acting [35,24]
pg 2.91 is stuck unclean for 28553.979772, current state active+remapped+backfilling, last acting [0,24]
pg 2.90 is stuck unclean for 30364.623932, current state active+degraded+remapped+backfill_toofull, last acting [41,24]
pg 2.92 is stuck unclean for 25031.211566, current state active+undersized+degraded+remapped+backfilling, last acting [8]
pg 9.99 is stuck unclean for 11862.827419, current state active+remapped+wait_backfill, last acting [13,4]
pg 2.8f is stuck unclean for 17426.148382, current state active+remapped+wait_backfill, last acting [15,9]
pg 2.88 is stuck unclean for 3591.054564, current state active+remapped+wait_backfill, last acting [14,9]
pg 9.8f is stuck unclean for 3595.395794, current state active+remapped+wait_backfill, last acting [9,15]
pg 2.87 is stuck unclean for 3844.271547, current state active+remapped+wait_backfill+backfill_toofull, last acting [1,2]
pg 2.81 is stuck unclean for 83759.347793, current state active+undersized+degraded+remapped+wait_backfill, last acting [39]
pg 9.8a is stuck unclean for 27697.026446, current state active+remapped+wait_backfill+backfill_toofull, last acting [12,1]
pg 2.79 is stuck unclean for 12137.676488, current state active+remapped+backfilling, last acting [7,40]
pg 2.78 is stuck unclean for 29127.120125, current state active+remapped+backfilling, last acting [0,6]
pg 2.75 is stuck unclean since forever, current state down+incomplete, last acting [7,15]
pg 2.74 is stuck unclean for 97658.383751, current state incomplete, last acting [13,5]
pg 9.7c is stuck unclean for 114170.469704, current state active+undersized+degraded+remapped+wait_backfill, last acting [39]
pg 9.7d is stuck unclean for 14077.123326, current state active+remapped+backfilling, last acting [5,24]
pg 2.71 is stuck unclean for 11859.344208, current state active+remapped+wait_backfill+backfill_toofull, last acting [21,3]
pg 2.73 is stuck unclean for 11859.417605, current state active+remapped+backfilling, last acting [39,15]
pg 9.78 is stuck unclean since forever, current state incomplete, last acting [5,24]
pg 9.79 is stuck unclean for 14595.569162, current state active+remapped+wait_backfill+backfill_toofull, last acting [39,3]
pg 2.6d is stuck unclean for 27802.265038, current state active+remapped+backfilling, last acting [4,13]
pg 9.62 is stuck unclean for 25030.488507, current state active+remapped+backfill_toofull, last acting [36,2]
pg 2.6a is stuck unclean for 20323.517565, current state active+remapped+wait_backfill, last acting [6,40]
pg 9.6c is stuck unclean for 14234.077824, current state active+remapped+wait_backfill+backfill_toofull, last acting [41,2]
pg 9.6a is stuck unclean for 27035.043476, current state active+remapped+backfill_toofull, last acting [36,4]
pg 2.63 is stuck unclean for 97665.177288, current state incomplete, last acting [31,7]
pg 2.5d is stuck unclean for 3549.763078, current state active+remapped+wait_backfill, last acting [9,34]
pg 2.5e is stuck unclean for 97736.064280, current state active+remapped+wait_backfill+backfill_toofull, last acting [35,36]
pg 2.52 is stuck unclean for 8871.832670, current state active+recovery_wait+degraded, last acting [6,4]
pg 9.59 is stuck unclean for 26868.986032, current state active+remapped+wait_backfill, last acting [31,34]
pg 2.4f is stuck unclean for 12108.325792, current state active+remapped+backfilling, last acting [11,40]
pg 2.49 is stuck unclean for 30446.302835, current state active+remapped+wait_backfill, last acting [9,24]
pg 9.42 is stuck unclean for 108836.104626, current state incomplete, last acting [31,12]
pg 2.45 is stuck unclean for 11284.580305, current state active+degraded+remapped+backfilling, last acting [24,2]
pg 9.4f is stuck unclean for 3893.672356, current state active+remapped+wait_backfill, last acting [0,21]
pg 2.44 is stuck unclean for 27623.439527, current state active+recovery_wait+degraded+remapped, last acting [6,11]
pg 9.4c is stuck unclean for 6907.681859, current state active+remapped+wait_backfill, last acting [15,36]
pg 2.46 is stuck unclean for 6907.682263, current state active+remapped+backfilling, last acting [11,24]
pg 9.49 is stuck unclean for 14683.624639, current state active+remapped+wait_backfill+backfill_toofull, last acting [2,31]
pg 11.35 is stuck unclean for 5872394.444913, current state active+remapped+wait_backfill, last acting [40,36]
pg 2.3e is stuck unclean for 6907.683506, current state active+remapped+backfilling, last acting [4,41]
pg 2.38 is stuck unclean for 5140.320861, current state active+remapped+wait_backfill, last acting [0,5]
pg 2.3b is stuck unclean for 14456.624593, current state active+remapped+wait_backfill+backfill_toofull, last acting [18,2]
pg 11.33 is stuck unclean since forever, current state down+incomplete, last acting [7,6]
pg 10.3d is stuck unclean for 3595.395921, current state active+remapped+wait_backfill, last acting [9,36]
pg 2.35 is stuck unclean for 8872.226171, current state active+recovery_wait+degraded, last acting [6,11]
pg 2.fc is stuck unclean for 5820.330202, current state active+remapped+backfilling, last acting [31,0]
pg 9.3f is stuck unclean since forever, current state incomplete, last acting [5,14]
pg 2.ff is stuck unclean for 3595.396088, current state active+remapped+backfilling, last acting [9,39]
pg 2.fe is stuck unclean for 6904.439076, current state active+remapped+backfilling, last acting [21,0]
pg 9.f5 is stuck unclean for 103009.439909, current state incomplete, last acting [18,5]
pg 7.34 is stuck unclean for 3886.510000, current state active+remapped+wait_backfill, last acting [13,39]
pg 2.fb is stuck unclean for 57173.985429, current state active+recovery_wait+degraded+remapped, last acting [6,8]
pg 2.32 is stuck unclean since forever, current state incomplete, last acting [5,13]
pg 9.fe is stuck unclean for 7418.564930, current state active+recovery_wait+degraded+remapped, last acting [6,3]
pg 9.26 is stuck unclean since forever, current state incomplete, last acting [5,24]
pg 2.f7 is stuck unclean for 6915.532617, current state active+remapped+backfilling, last acting [4,15]
pg 9.fc is stuck unclean for 201476.093824, current state incomplete, last acting [13,5]
pg 7.2b is stuck undersized for 64282.169836, current state stale+active+undersized+degraded, last acting [5]
pg 2.1e is stuck undersized for 3895.207475, current state active+undersized+degraded+remapped+backfilling, last acting [13]
pg 2.de is stuck undersized for 3886.529396, current state active+undersized+degraded+remapped+backfilling, last acting [3]
pg 2.d7 is stuck undersized for 7417.316099, current state active+undersized+degraded+remapped+backfilling, last acting [36]
pg 2.d1 is stuck undersized for 6903.297196, current state active+undersized+degraded+remapped+backfill_toofull, last acting [36]
pg 2.0 is stuck undersized for 4999.401505, current state active+undersized+degraded+remapped+wait_backfill+backfill_toofull, last acting [1]
pg 2.92 is stuck undersized for 4999.406547, current state active+undersized+degraded+remapped+backfilling, last acting [8]
pg 2.81 is stuck undersized for 7417.378668, current state active+undersized+degraded+remapped+wait_backfill, last acting [39]
pg 9.7c is stuck undersized for 3894.953894, current state active+undersized+degraded+remapped+wait_backfill, last acting [39]
pg 9.25 is stuck degraded for 7413.083043, current state active+degraded+remapped+wait_backfill, last acting [15,2]
pg 7.2b is stuck degraded for 64282.169913, current state stale+active+undersized+degraded, last acting [5]
pg 2.f1 is stuck degraded for 3848.032008, current state active+recovery_wait+degraded, last acting [13,8]
pg 2.f2 is stuck degraded for 7411.108195, current state active+recovery_wait+degraded, last acting [6,4]
pg 2.27 is stuck degraded for 3893.230317, current state active+recovery_wait+degraded, last acting [13,3]
pg 2.1c is stuck degraded for 7414.316299, current state active+degraded+remapped+backfilling, last acting [14,11]
pg 2.1e is stuck degraded for 3895.207564, current state active+undersized+degraded+remapped+backfilling, last acting [13]
pg 2.de is stuck degraded for 3886.529484, current state active+undersized+degraded+remapped+backfilling, last acting [3]
pg 2.d7 is stuck degraded for 7417.316187, current state active+undersized+degraded+remapped+backfilling, last acting [36]
pg 4.8 is stuck degraded for 3490.406821, current state active+recovery_wait+degraded, last acting [15,2]
pg 2.d1 is stuck degraded for 6903.297288, current state active+undersized+degraded+remapped+backfill_toofull, last acting [36]
pg 2.0 is stuck degraded for 4999.401597, current state active+undersized+degraded+remapped+wait_backfill+backfill_toofull, last acting [1]
pg 2.cb is stuck degraded for 7413.316930, current state active+degraded+remapped+wait_backfill+backfill_toofull, last acting [1,4]
pg 2.2 is stuck degraded for 3894.930841, current state active+recovery_wait+degraded+remapped, last acting [13,9]
pg 2.c7 is stuck degraded for 3886.500328, current state active+recovery_wait+degraded, last acting [13,2]
pg 9.ad is stuck degraded for 7411.181412, current state active+recovery_wait+degraded, last acting [6,8]
pg 2.90 is stuck degraded for 3893.715235, current state active+degraded+remapped+backfill_toofull, last acting [41,24]
pg 2.92 is stuck degraded for 4999.406655, current state active+undersized+degraded+remapped+backfilling, last acting [8]
pg 2.81 is stuck degraded for 7417.378776, current state active+undersized+degraded+remapped+wait_backfill, last acting [39]
pg 9.7c is stuck degraded for 3894.954001, current state active+undersized+degraded+remapped+wait_backfill, last acting [39]
pg 2.52 is stuck degraded for 7411.108431, current state active+recovery_wait+degraded, last acting [6,4]
pg 2.45 is stuck degraded for 3892.755878, current state active+degraded+remapped+backfilling, last acting [24,2]
pg 2.44 is stuck degraded for 7411.213966, current state active+recovery_wait+degraded+remapped, last acting [6,11]
pg 2.35 is stuck degraded for 7411.295348, current state active+recovery_wait+degraded, last acting [6,11]
pg 2.fb is stuck degraded for 6903.301076, current state active+recovery_wait+degraded+remapped, last acting [6,8]
pg 9.fe is stuck degraded for 7413.453955, current state active+recovery_wait+degraded+remapped, last acting [6,3]
pg 7.2b is stuck stale for 64232.262041, current state stale+active+undersized+degraded, last acting [5]
pg 2.fc is active+remapped+backfilling, acting [31,0]
pg 2.ff is active+remapped+backfilling, acting [9,39]
pg 9.f5 is incomplete, acting [18,5]
pg 2.fe is active+remapped+backfilling, acting [21,0]
pg 2.fb is active+recovery_wait+degraded+remapped, acting [6,8]
pg 9.fe is active+recovery_wait+degraded+remapped, acting [6,3]
pg 9.fc is incomplete, acting [13,5]
pg 2.f7 is active+remapped+backfilling, acting [4,15]
pg 2.f1 is active+recovery_wait+degraded, acting [13,8]
pg 9.fb is active+remapped+wait_backfill, acting [8,39]
pg 2.f3 is active+remapped+wait_backfill, acting [6,9]
pg 2.f2 is active+recovery_wait+degraded, acting [6,4]
pg 2.ed is active+remapped+backfilling, acting [9,40]
pg 2.e8 is active+remapped+wait_backfill, acting [15,36]
pg 2.eb is active+remapped+backfilling, acting [0,31]
pg 2.ea is active+remapped+backfilling, acting [9,34]
pg 2.e0 is active+remapped+inconsistent+wait_backfill, acting [6,9]
pg 2.e3 is active+remapped+backfilling, acting [4,41]
pg 9.d6 is active+remapped+wait_backfill, acting [1,9]
pg 2.dc is active+remapped+backfilling, acting [40,4]
pg 2.df is active+remapped+wait_backfill, acting [0,13]
pg 2.de is active+undersized+degraded+remapped+backfilling, acting [3]
pg 2.d8 is active+remapped+backfilling, acting [21,41]
pg 2.db is active+remapped+backfilling, acting [9,40]
pg 9.de is incomplete, acting [6,5]
pg 9.df is active+remapped+wait_backfill+backfill_toofull, acting [35,2]
pg 9.dc is incomplete, acting [6,7]
pg 2.d7 is active+undersized+degraded+remapped+backfilling, acting [36]
pg 2.d1 is active+undersized+degraded+remapped+backfill_toofull, acting [36]
pg 3.d0 is active+remapped+wait_backfill, acting [12,2]
pg 9.d8 is down+incomplete, acting [21,5]
pg 9.d9 is active+remapped+wait_backfill+backfill_toofull, acting [39,4]
pg 9.c5 is active+remapped+wait_backfill, acting [5,38]
pg 2.cb is active+degraded+remapped+wait_backfill+backfill_toofull, acting [1,4]
pg 2.c4 is active+remapped+backfilling, acting [34,9]
pg 2.c7 is active+recovery_wait+degraded, acting [13,2]
pg 2.c2 is active+remapped+wait_backfill, acting [9,15]
pg 9.c9 is active+remapped+wait_backfill+backfill_toofull, acting [5,34]
pg 2.bd is active+remapped+backfilling, acting [39,9]
pg 2.bf is incomplete, acting [15,7]
pg 9.b3 is active+remapped+wait_backfill, acting [15,35]
pg 2.b6 is active+remapped+backfilling, acting [1,31]
pg 2.b0 is incomplete, acting [24,7]
pg 2.b3 is active+remapped+backfilling, acting [3,0]
pg 2.ad is active+remapped+backfilling, acting [2,39]
pg 2.ae is active+remapped+backfilling, acting [34,13]
pg 9.a0 is active+remapped+wait_backfill+backfill_toofull, acting [1,3]
pg 2.aa is active+remapped+backfilling, acting [40,9]
pg 2.a7 is active+remapped+wait_backfill, acting [2,1]
pg 2.a6 is active+remapped+wait_backfill+backfill_toofull, acting [2,35]
pg 9.ad is active+recovery_wait+degraded, acting [6,8]
pg 2.a1 is incomplete, acting [14,12]
pg 9.a8 is incomplete, acting [14,7]
pg 2.a2 is incomplete, acting [5,13]
pg 2.9d is active+remapped+backfilling, acting [6,38]
pg 2.9c is active+remapped+wait_backfill, acting [35,11]
pg 2.9b is active+remapped+wait_backfill, acting [6,0]
pg 9.91 is active+remapped+wait_backfill, acting [31,9]
pg 2.97 is active+remapped+backfilling, acting [35,24]
pg 2.91 is active+remapped+backfilling, acting [0,24]
pg 2.90 is active+degraded+remapped+backfill_toofull, acting [41,24]
pg 2.92 is active+undersized+degraded+remapped+backfilling, acting [8]
pg 9.99 is active+remapped+wait_backfill, acting [13,4]
pg 2.8f is active+remapped+wait_backfill, acting [15,9]
pg 2.88 is active+remapped+wait_backfill, acting [14,9]
pg 9.8f is active+remapped+wait_backfill, acting [9,15]
pg 2.87 is active+remapped+wait_backfill+backfill_toofull, acting [1,2]
pg 2.81 is active+undersized+degraded+remapped+wait_backfill, acting [39]
pg 9.8a is active+remapped+wait_backfill+backfill_toofull, acting [12,1]
pg 2.79 is active+remapped+backfilling, acting [7,40]
pg 2.78 is active+remapped+backfilling, acting [0,6]
pg 2.75 is down+incomplete, acting [7,15]
pg 2.74 is incomplete, acting [13,5]
pg 9.7c is active+undersized+degraded+remapped+wait_backfill, acting [39]
pg 9.7d is active+remapped+backfilling, acting [5,24]
pg 2.71 is active+remapped+wait_backfill+backfill_toofull, acting [21,3]
pg 2.73 is active+remapped+backfilling, acting [39,15]
pg 9.78 is incomplete, acting [5,24]
pg 9.79 is active+remapped+wait_backfill+backfill_toofull, acting [39,3]
pg 2.6d is active+remapped+backfilling, acting [4,13]
pg 9.62 is active+remapped+backfill_toofull, acting [36,2]
pg 2.6a is active+remapped+wait_backfill, acting [6,40]
pg 9.6c is active+remapped+wait_backfill+backfill_toofull, acting [41,2]
pg 9.6a is active+remapped+backfill_toofull, acting [36,4]
pg 2.63 is incomplete, acting [31,7]
pg 2.5d is active+remapped+wait_backfill, acting [9,34]
pg 2.5e is active+remapped+wait_backfill+backfill_toofull, acting [35,36]
pg 2.52 is active+recovery_wait+degraded, acting [6,4]
pg 9.59 is active+remapped+wait_backfill, acting [31,34]
pg 2.4f is active+remapped+backfilling, acting [11,40]
pg 2.49 is active+remapped+wait_backfill, acting [9,24]
pg 9.42 is incomplete, acting [31,12]
pg 2.45 is active+degraded+remapped+backfilling, acting [24,2]
pg 2.44 is active+recovery_wait+degraded+remapped, acting [6,11]
pg 9.4f is active+remapped+wait_backfill, acting [0,21]
pg 9.4c is active+remapped+wait_backfill, acting [15,36]
pg 2.46 is active+remapped+backfilling, acting [11,24]
pg 9.49 is active+remapped+wait_backfill+backfill_toofull, acting [2,31]
pg 11.35 is active+remapped+wait_backfill, acting [40,36]
pg 2.3e is active+remapped+backfilling, acting [4,41]
pg 2.38 is active+remapped+wait_backfill, acting [0,5]
pg 2.3b is active+remapped+wait_backfill+backfill_toofull, acting [18,2]
pg 11.33 is down+incomplete, acting [7,6]
pg 2.35 is active+recovery_wait+degraded, acting [6,11]
pg 10.3d is active+remapped+wait_backfill, acting [9,36]
pg 9.3f is incomplete, acting [5,14]
pg 7.34 is active+remapped+wait_backfill, acting [13,39]
pg 2.32 is incomplete, acting [5,13]
pg 9.26 is incomplete, acting [5,24]
pg 11.27 is active+remapped+wait_backfill+backfill_toofull, acting [4,36]
pg 9.25 is active+degraded+remapped+wait_backfill, acting [15,2]
pg 2.29 is active+remapped+wait_backfill, acting [24,40]
pg 9.22 is incomplete, acting [7,24]
pg 9.23 is active+remapped+wait_backfill, acting [35,9]
pg 2.2a is incomplete, acting [24,5]
pg 2.24 is active+remapped+wait_backfill, acting [13,40]
pg 2.27 is active+recovery_wait+degraded, acting [13,3]
pg 11.29 is active+remapped+wait_backfill, acting [14,40]
pg 2.1d is active+remapped+backfilling, acting [3,6]
pg 2.1c is active+degraded+remapped+backfilling, acting [14,11]
pg 11.15 is active+remapped+wait_backfill+backfill_toofull, acting [34,9]
pg 2.1f is active+remapped+wait_backfill, acting [15,3]
pg 11.16 is active+remapped+wait_backfill, acting [15,40]
pg 2.1e is active+undersized+degraded+remapped+backfilling, acting [13]
pg 0.1c is active+remapped+wait_backfill, acting [12,3]
pg 2.18 is active+remapped+backfilling, acting [18,9]
pg 11.13 is active+remapped+wait_backfill, acting [14,8]
pg 2.15 is incomplete, acting [7,31]
pg 11.1c is down+incomplete, acting [6,7]
pg 9.1e is incomplete, acting [7,15]
pg 2.14 is active+remapped+backfilling, acting [18,9]
pg 11.1e is active+remapped+wait_backfill, acting [3,38]
pg 9.1d is active+remapped+backfill_toofull, acting [12,36]
pg 11.19 is active+remapped+wait_backfill, acting [15,38]
pg 7.15 is active+remapped+wait_backfill, acting [13,2]
pg 2.13 is incomplete, acting [7,10]
pg 7.16 is incomplete, acting [6,7]
pg 9.18 is active+remapped+backfill_toofull, acting [18,13]
pg 2.d is incomplete, acting [5,10]
pg 9.6 is active+remapped+backfill_toofull, acting [2,41]
pg 7.a is active+remapped+wait_backfill, acting [38,2]
pg 4.8 is active+recovery_wait+degraded, acting [15,2]
pg 9.5 is incomplete, acting [5,18]
pg 9.3 is incomplete, acting [7,15]
pg 2.b is active+remapped+backfilling, acting [40,24]
pg 9.1 is active+remapped+wait_backfill+backfill_toofull, acting [39,3]
pg 11.d is active+remapped+wait_backfill+backfill_toofull, acting [36,4]
pg 9.a is incomplete, acting [18,7]
pg 2.0 is active+undersized+degraded+remapped+wait_backfill+backfill_toofull, acting [1]
pg 11.9 is active+remapped+wait_backfill, acting [21,39]
pg 2.3 is incomplete, acting [14,5]
pg 9.8 is active+remapped+wait_backfill+backfill_toofull, acting [5,24]
pg 2.2 is active+recovery_wait+degraded+remapped, acting [13,9]
33 ops are blocked > 16777.2 sec
368 ops are blocked > 8388.61 sec
238 ops are blocked > 4194.3 sec
87 ops are blocked > 1048.58 sec
2 ops are blocked > 8388.61 sec on osd.5
98 ops are blocked > 4194.3 sec on osd.5
98 ops are blocked > 8388.61 sec on osd.6
1 ops are blocked > 8388.61 sec on osd.7
27 ops are blocked > 4194.3 sec on osd.7
12 ops are blocked > 4194.3 sec on osd.13
87 ops are blocked > 1048.58 sec on osd.13
2 ops are blocked > 16777.2 sec on osd.14
98 ops are blocked > 8388.61 sec on osd.14
3 ops are blocked > 16777.2 sec on osd.15
97 ops are blocked > 8388.61 sec on osd.15
1 ops are blocked > 4194.3 sec on osd.18
100 ops are blocked > 4194.3 sec on osd.24
28 ops are blocked > 16777.2 sec on osd.31
72 ops are blocked > 8388.61 sec on osd.31
9 osds have slow requests
recovery 59636/5032695 objects degraded (1.185%)
recovery 1280976/5032695 objects misplaced (25.453%)
1 scrub errors
noscrub,nodeep-scrub flag(s) set


On the first failed host is 6, 13, 14, 15, 18, 24, 31

On the second host that went down was 5 and 7



On Sun, 2 Sep 2018 at 15:15, David Turner <drakonstein@xxxxxxxxx> wrote:
When the first node went offline with a dead SSD journal, all of the dates on the OSDs was useless. Unless you could flush the journals, you can't guarantee that a wire the cluster think happened actually made it to the disk.  The proper procedure here is to remove those OSDs and add them again as new OSDs.

`ceph health detail` will give you some more information on the blocked requests. Depending on what that shows you can often find the OSD that is causing the problems.  But your biggest problem is that you have dishes with potentially inconsistent data in your closer.

On Sun, Sep 2, 2018, 4:42 AM Lee <lquince@xxxxxxxxx> wrote:
Running 0.94.5 as part of a Openstack enviroment, our ceph setup is 3x OSD Nodes 3x MON Nodes, yesterday we had a aircon outage in our hosting enviroment, 1 OSD node failed (offline with a the journal SSD dead) left with 2 nodes running correctly, 2 hours later a second OSD node failed complaining of readwrite errors to the physical drives, i assume this was a heat issue as when rebooted this came back online ok and ceph started to repair itself. We have since brought the first failed node back on by replacing the ssd and recreating the journals hoping it would all repair.. Our pools are min 2 repl. 

The problem we have is client IO (read) is totally blocked, and when I query the stuck PG's it just hangs..

For example the check version command just errors with:

Error EINTR: problem getting command descriptions from on various OSD's so I cannot even query the inactive PG's

root@node31-a4:~# ceph -s
    cluster 7c24e1b9-24b3-4a1b-8889-9b2d7fd88cd2
     health HEALTH_WARN
            83 pgs backfill
            2 pgs backfill_toofull
            3 pgs backfilling
            48 pgs degraded
            1 pgs down
            31 pgs incomplete
            1 pgs recovering
            29 pgs recovery_wait
            1 pgs stale
            48 pgs stuck degraded
            31 pgs stuck inactive
            1 pgs stuck stale
            148 pgs stuck unclean
            17 pgs stuck undersized
            17 pgs undersized
            599 requests are blocked > 32 sec
            recovery 111489/4697618 objects degraded (2.373%)
            recovery 772268/4697618 objects misplaced (16.440%)
            recovery 1/2171314 unfound (0.000%)
            election epoch 198, quorum 0,1,2 bc07s12-a7,bc07s14-a7,bc07s13-a7
     osdmap e18727: 25 osds: 25 up, 25 in; 90 remapped pgs
      pgmap v70996322: 1792 pgs, 13 pools, 8210 GB data, 2120 kobjects
            16783 GB used, 6487 GB / 23270 GB avail
            111489/4697618 objects degraded (2.373%)
            772268/4697618 objects misplaced (16.440%)
            1/2171314 unfound (0.000%)
                1639 active+clean
                  66 active+remapped+wait_backfill
                  30 incomplete
                  25 active+recovery_wait+degraded
                  15 active+undersized+degraded+remapped+wait_backfill
                   4 active+recovery_wait+degraded+remapped
                   4 active+clean+scrubbing
                   2 active+remapped+wait_backfill+backfill_toofull
                   1 down+incomplete
                   1 active+remapped+backfilling
                   1 active+clean+scrubbing+deep
                   1 stale+active+undersized+degraded
                   1 active+undersized+degraded+remapped+backfilling
                   1 active+degraded+remapped+backfilling
                   1 active+recovering+degraded
recovery io 29385 kB/s, 7 objects/s
  client io 5877 B/s wr, 1 op/s

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux