Re: PGs down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for the delay, Igor; answers inline.

On Mon, Dec 14, 2020 at 2:09 AM Igor Fedotov <ifedotov@xxxxxxx> wrote:

> Hi Jeremy,
>
> I think you lost the data for OSD.11 & .12  I'm not aware of any reliable
> enough way to recover RocksDB from this sort of errors.
>
> Theoretically you might want to disable auto compaction for RocksDB for
> these daemons and try to bring then up and attempt to drain the data out of
> them to different OSDs then. As currently the log you shared shows error
> during compaction there is some chance that during regular operation OSD
> wouldn't need broken data (at least for some time). In fact I've never
> heard someone tried this approach so this would be a pretty cutting edge
> investigation...
>
Will attempt to disable compaction and report.

> Honestly the chance of 100% success is pretty low but some additional data
> might be saved.
>
>
> Back to DB corruption root causes itself.
>
> As it looks like we have some data consistency issues with RocksDB in the
> latest Octopus and Nautilus releases I'm currently trying to collect the
> stats for the known cases. Hence I'd highly appreciate if you answer the
> following questions
>
> 1) Have I got this properly that hardware issue happened to the same node
> where OSD.11 & .12 are located? Or they are at a different one but crashed
> after hardware failure happened to that node and were unable to start since
> then?
>
Hardware failure happened on the node housing osd.11 and osd.12. There has
been intermittent OSD stability issues on other nodes, but no hardware
failure, and nothing unrecoverable.

> 2) If they're at the same node - do they have standalone DB/WAL volumes?
> If so have you checked them for hardware failures as well?
>

They had a separate DB/WAL on a shared SSD; my first suspicion was that as
a shared failure point, but have found no evidence of I/O issues to the SSD.

> 3) Not sure if it makes sense but just in case - have you checked dmesg
> output for any disk errors as well?
>
I had, which is why I became aware of the SATA failure.

> 4) Haven't you performed Ceph upgrade recently. Or more generally - was
> the cluster deployed with the current Ceph version or it was an earlier one?
>
Cluster deployed with 14.2.9, IIRC; was likely at 14.2.11 when initial
failure occured.

>
> Thanks,
>
> Igor
>
>
> On 12/14/2020 5:05 AM, Jeremy Austin wrote:
>
> OSD 12 looks much the same.I don't have logs back to the original date,
> but this looks very similar — db/sst corruption. The standard fsck
> approaches couldn't fix it. I believe it was a form of ATA failure — OSD 11
> and 12, if I recall correctly, did not actually experience
> SMARTD-reportable errors. (Essentially, fans died on an internal SATA
> enclosure. As the enclosure had no sensor mechanism, I didn't realize it
> until drive temps started to climb. I believe most of the drives survived
> OK, but the enclosure itself I ultimately had to completely bypass, even
> after replacing fans.)
>
> My assumption, once ceph fsck approaches failed, was that I'd need to mark
> 11 and 12 (and maybe 4) as lost, but I was reluctant to do so until I
> confirmed that I had absolutely lost data beyond recall.
>
> On Sat, Dec 12, 2020 at 10:24 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:
>
>> Hi Jeremy,
>>
>> wondering what were the OSDs' logs when they crashed for the first time?
>>
>> And does OSD.12 reports the similar problem for now:
>>
>> 3> 2020-12-12 20:23:45.756 7f2d21404700 -1 rocksdb: submit_common error:
>> Corruption: block checksum mismatch: expected 3113305400, got 1242690251 in
>> db/000348.sst offset 47935290 size 4704 code = 2 Rocksdb transaction:
>>
>> ?
>>
>> Thanks,
>> Igor
>> On 12/13/2020 8:48 AM, Jeremy Austin wrote:
>>
>> I could use some input from more experienced folks…
>>
>> First time seeing this behavior. I've been running ceph in production
>> (replicated) since 2016 or earlier.
>>
>> This, however, is a small 3-node cluster for testing EC. Crush map rules
>> should sustain the loss of an entire node.
>> Here's the EC rule:
>>
>> rule cephfs425 { id 6 type erasure min_size 3 max_size 6 step
>> set_chooseleaf_tries 40 step set_choose_tries 400 step take default step
>> choose indep 3 type host step choose indep 2 type osd step emit }
>>
>>
>> I had actual hardware failure on one node. Interestingly, this appears to
>> have resulted in data loss. OSDs began to crash in a cascade on other nodes
>> (i.e., nodes with no known hardware failure). Not a low RAM problem.
>>
>> I could use some pointers about how to get the down PGs back up — I *think*
>> there are enough EC shards, even disregarding the OSDs that crash on start.
>>
>> nautilus 14.2.15
>>
>>  ceph osd tree
>> ID  CLASS WEIGHT   TYPE NAME       STATUS REWEIGHT PRI-AFF
>>  -1       54.75960 root default
>> -10       16.81067     host sumia
>>   1   hdd  5.57719         osd.1       up  1.00000 1.00000
>>   5   hdd  5.58469         osd.5       up  1.00000 1.00000
>>   6   hdd  5.64879         osd.6       up  1.00000 1.00000
>>  -7       16.73048     host sumib
>>   0   hdd  5.57899         osd.0       up  1.00000 1.00000
>>   2   hdd  5.56549         osd.2       up  1.00000 1.00000
>>   3   hdd  5.58600         osd.3       up  1.00000 1.00000
>>  -3       21.21844     host tower1
>>   4   hdd  3.71680         osd.4       up        0 1.00000
>>   7   hdd  1.84799         osd.7       up  1.00000 1.00000
>>   8   hdd  3.71680         osd.8       up  1.00000 1.00000
>>   9   hdd  1.84929         osd.9       up  1.00000 1.00000
>>  10   hdd  2.72899         osd.10      up  1.00000 1.00000
>>  11   hdd  3.71989         osd.11    down        0 1.00000
>>  12   hdd  3.63869         osd.12    down        0 1.00000
>>
>>   cluster:
>>     id:     d0b4c175-02ba-4a64-8040-eb163002cba6
>>     health: HEALTH_ERR
>>             1 MDSs report slow requests
>>             4/4239345 objects unfound (0.000%)
>>             Too many repaired reads on 3 OSDs
>>             Reduced data availability: 7 pgs inactive, 7 pgs down
>>             Possible data damage: 4 pgs recovery_unfound
>>             Degraded data redundancy: 95807/24738783 objects degraded
>> (0.387%), 4 pgs degraded, 3 pgs undersized
>>             7 pgs not deep-scrubbed in time
>>             7 pgs not scrubbed in time
>>
>>   services:
>>     mon: 3 daemons, quorum sumib,tower1,sumia (age 4d)
>>     mgr: sumib(active, since 7d), standbys: sumia, tower1
>>     mds: cephfs:1 {0=sumib=up:active} 2 up:standby
>>     osd: 13 osds: 11 up (since 3d), 10 in (since 4d); 3 remapped pgs
>>
>>   data:
>>     pools:   5 pools, 256 pgs
>>     objects: 4.24M objects, 15 TiB
>>     usage:   24 TiB used, 24 TiB / 47 TiB avail
>>     pgs:     2.734% pgs not active
>>              95807/24738783 objects degraded (0.387%)
>>              47910/24738783 objects misplaced (0.194%)
>>              4/4239345 objects unfound (0.000%)
>>              245 active+clean
>>              7   down
>>              3   active+recovery_unfound+undersized+degraded+remapped
>>              1   active+recovery_unfound+degraded+repair
>>
>>   progress:
>>     Rebalancing after osd.12 marked out
>>       [============================..]
>>     Rebalancing after osd.4 marked out
>>       [=============================.]
>>
>> An snipped from an example down pg:
>>     "up": [
>>         3,
>>         2,
>>         5,
>>         1,
>>         8,
>>         9
>>     ],
>>     "acting": [
>>         3,
>>         2,
>>         5,
>>         1,
>>         8,
>>         9
>>     ],
>> <snip>
>>          ],
>>             "blocked": "peering is blocked due to down osds",
>>             "down_osds_we_would_probe": [
>>                 11,
>>                 12
>>             ],
>>             "peering_blocked_by": [
>>                 {
>>                     "osd": 11,
>>                     "current_lost_at": 0,
>>                     "comment": "starting or marking this osd lost may let
>> us proceed"
>>                 },
>>                 {
>>                     "osd": 12,
>>                     "current_lost_at": 0,
>>                     "comment": "starting or marking this osd lost may let
>> us proceed"
>>                 }
>>             ]
>>         },
>>         {
>>
>> Oddly, these OSDs possibly did NOT experience hardware failure. However,
>> they won't start -- see pastebin for ceph-osd.11.log
>> https://pastebin.com/6U6sQJuJ
>>
>>
>> HEALTH_ERR 1 MDSs report slow requests; 4/4239345 objects unfound (0.000%);
>> Too many repaired reads on 3 OSDs; Reduced data availability
>> : 7 pgs inactive, 7 pgs down; Possible data damage: 4 pgs recovery_unfound;
>> Degraded data redundancy: 95807/24738783 objects degraded (0
>> .387%), 4 pgs degraded, 3 pgs undersized; 7 pgs not deep-scrubbed in time;
>> 7 pgs not scrubbed in time
>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>>     mdssumib(mds.0): 42 slow requests are blocked > 30 secs
>> OBJECT_UNFOUND 4/4239345 objects unfound (0.000%)
>>     pg 19.5 has 1 unfound objects
>>     pg 15.2f has 1 unfound objects
>>     pg 15.41 has 1 unfound objects
>>     pg 15.58 has 1 unfound objects
>> OSD_TOO_MANY_REPAIRS Too many repaired reads on 3 OSDs
>>     osd.9 had 9664 reads repaired
>>     osd.7 had 9665 reads repaired
>>     osd.4 had 12 reads repaired
>> PG_AVAILABILITY Reduced data availability: 7 pgs inactive, 7 pgs down
>>     pg 15.10 is down, acting [3,2,5,1,8,9]
>>     pg 15.1e is down, acting [5,1,9,8,2,3]
>>     pg 15.40 is down, acting [7,10,1,5,3,2]
>>     pg 15.4a is down, acting [0,3,5,6,9,10]
>>     pg 15.6a is down, acting [3,2,6,1,10,8]
>>     pg 15.71 is down, acting [3,2,1,6,8,10]
>>     pg 15.76 is down, acting [2,0,6,5,10,9]
>> PG_DAMAGED Possible data damage: 4 pgs recovery_unfound
>>     pg 15.2f is active+recovery_unfound+undersized+degraded+remapped,
>> acting [5,1,0,3,2147483647,7], 1 unfound
>>     pg 15.41 is active+recovery_unfound+undersized+degraded+remapped,
>> acting [5,1,0,3,2147483647,2147483647], 1 unfound
>>     pg 15.58 is active+recovery_unfound+undersized+degraded+remapped,
>> acting [10,2147483647,2,3,1,5], 1 unfound
>>     pg 19.5 is active+recovery_unfound+degraded+repair, acting
>> [3,2,5,1,8,10], 1 unfound
>> PG_DEGRADED Degraded data redundancy: 95807/24738783 objects degraded
>> (0.387%), 4 pgs degraded, 3 pgs undersized
>>     pg 15.2f is stuck undersized for 635305.932075, current state
>> active+recovery_unfound+undersized+degraded+remapped, last acting
>> [5,1,0,3,2147483647,7]
>>     pg 15.41 is stuck undersized for 364298.836902, current state
>> active+recovery_unfound+undersized+degraded+remapped, last acting
>> [5,1,0,3,2147483647,2147483647]
>>     pg 15.58 is stuck undersized for 384461.110229, current state
>> active+recovery_unfound+undersized+degraded+remapped, last acting
>> [10,2147483647,2,3,1,5]
>>     pg 19.5 is active+recovery_unfound+degraded+repair, acting
>> [3,2,5,1,8,10], 1 unfound
>> PG_NOT_DEEP_SCRUBBED 7 pgs not deep-scrubbed in time
>>     pg 15.76 not deep-scrubbed since 2020-10-21 14:30:03.935228
>>     pg 15.71 not deep-scrubbed since 2020-10-21 12:20:46.235792
>>     pg 15.6a not deep-scrubbed since 2020-10-21 07:52:33.914083
>>     pg 15.10 not deep-scrubbed since 2020-10-22 03:24:40.465367
>>     pg 15.1e not deep-scrubbed since 2020-10-22 10:37:36.169959
>>     pg 15.40 not deep-scrubbed since 2020-10-23 05:33:35.208748
>>     pg 15.4a not deep-scrubbed since 2020-10-22 05:14:06.981035
>> PG_NOT_SCRUBBED 7 pgs not scrubbed in time
>>     pg 15.76 not scrubbed since 2020-10-24 08:12:40.090831
>>     pg 15.71 not scrubbed since 2020-10-25 05:22:40.573572
>>     pg 15.6a not scrubbed since 2020-10-24 15:03:09.189964
>>     pg 15.10 not scrubbed since 2020-10-24 16:25:08.826981
>>     pg 15.1e not scrubbed since 2020-10-24 16:05:03.080127
>>     pg 15.40 not scrubbed since 2020-10-24 11:58:04.290488
>>     pg 15.4a not scrubbed since 2020-10-24 11:32:44.573551
>>
>>
>
> --
> Jeremy Austin
> jhaustin@xxxxxxxxx
>
>

-- 
Jeremy Austin
jhaustin@xxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux