Sorry for the delay, Igor; answers inline. On Mon, Dec 14, 2020 at 2:09 AM Igor Fedotov <ifedotov@xxxxxxx> wrote: > Hi Jeremy, > > I think you lost the data for OSD.11 & .12 I'm not aware of any reliable > enough way to recover RocksDB from this sort of errors. > > Theoretically you might want to disable auto compaction for RocksDB for > these daemons and try to bring then up and attempt to drain the data out of > them to different OSDs then. As currently the log you shared shows error > during compaction there is some chance that during regular operation OSD > wouldn't need broken data (at least for some time). In fact I've never > heard someone tried this approach so this would be a pretty cutting edge > investigation... > Will attempt to disable compaction and report. > Honestly the chance of 100% success is pretty low but some additional data > might be saved. > > > Back to DB corruption root causes itself. > > As it looks like we have some data consistency issues with RocksDB in the > latest Octopus and Nautilus releases I'm currently trying to collect the > stats for the known cases. Hence I'd highly appreciate if you answer the > following questions > > 1) Have I got this properly that hardware issue happened to the same node > where OSD.11 & .12 are located? Or they are at a different one but crashed > after hardware failure happened to that node and were unable to start since > then? > Hardware failure happened on the node housing osd.11 and osd.12. There has been intermittent OSD stability issues on other nodes, but no hardware failure, and nothing unrecoverable. > 2) If they're at the same node - do they have standalone DB/WAL volumes? > If so have you checked them for hardware failures as well? > They had a separate DB/WAL on a shared SSD; my first suspicion was that as a shared failure point, but have found no evidence of I/O issues to the SSD. > 3) Not sure if it makes sense but just in case - have you checked dmesg > output for any disk errors as well? > I had, which is why I became aware of the SATA failure. > 4) Haven't you performed Ceph upgrade recently. Or more generally - was > the cluster deployed with the current Ceph version or it was an earlier one? > Cluster deployed with 14.2.9, IIRC; was likely at 14.2.11 when initial failure occured. > > Thanks, > > Igor > > > On 12/14/2020 5:05 AM, Jeremy Austin wrote: > > OSD 12 looks much the same.I don't have logs back to the original date, > but this looks very similar — db/sst corruption. The standard fsck > approaches couldn't fix it. I believe it was a form of ATA failure — OSD 11 > and 12, if I recall correctly, did not actually experience > SMARTD-reportable errors. (Essentially, fans died on an internal SATA > enclosure. As the enclosure had no sensor mechanism, I didn't realize it > until drive temps started to climb. I believe most of the drives survived > OK, but the enclosure itself I ultimately had to completely bypass, even > after replacing fans.) > > My assumption, once ceph fsck approaches failed, was that I'd need to mark > 11 and 12 (and maybe 4) as lost, but I was reluctant to do so until I > confirmed that I had absolutely lost data beyond recall. > > On Sat, Dec 12, 2020 at 10:24 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: > >> Hi Jeremy, >> >> wondering what were the OSDs' logs when they crashed for the first time? >> >> And does OSD.12 reports the similar problem for now: >> >> 3> 2020-12-12 20:23:45.756 7f2d21404700 -1 rocksdb: submit_common error: >> Corruption: block checksum mismatch: expected 3113305400, got 1242690251 in >> db/000348.sst offset 47935290 size 4704 code = 2 Rocksdb transaction: >> >> ? >> >> Thanks, >> Igor >> On 12/13/2020 8:48 AM, Jeremy Austin wrote: >> >> I could use some input from more experienced folks… >> >> First time seeing this behavior. I've been running ceph in production >> (replicated) since 2016 or earlier. >> >> This, however, is a small 3-node cluster for testing EC. Crush map rules >> should sustain the loss of an entire node. >> Here's the EC rule: >> >> rule cephfs425 { id 6 type erasure min_size 3 max_size 6 step >> set_chooseleaf_tries 40 step set_choose_tries 400 step take default step >> choose indep 3 type host step choose indep 2 type osd step emit } >> >> >> I had actual hardware failure on one node. Interestingly, this appears to >> have resulted in data loss. OSDs began to crash in a cascade on other nodes >> (i.e., nodes with no known hardware failure). Not a low RAM problem. >> >> I could use some pointers about how to get the down PGs back up — I *think* >> there are enough EC shards, even disregarding the OSDs that crash on start. >> >> nautilus 14.2.15 >> >> ceph osd tree >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 54.75960 root default >> -10 16.81067 host sumia >> 1 hdd 5.57719 osd.1 up 1.00000 1.00000 >> 5 hdd 5.58469 osd.5 up 1.00000 1.00000 >> 6 hdd 5.64879 osd.6 up 1.00000 1.00000 >> -7 16.73048 host sumib >> 0 hdd 5.57899 osd.0 up 1.00000 1.00000 >> 2 hdd 5.56549 osd.2 up 1.00000 1.00000 >> 3 hdd 5.58600 osd.3 up 1.00000 1.00000 >> -3 21.21844 host tower1 >> 4 hdd 3.71680 osd.4 up 0 1.00000 >> 7 hdd 1.84799 osd.7 up 1.00000 1.00000 >> 8 hdd 3.71680 osd.8 up 1.00000 1.00000 >> 9 hdd 1.84929 osd.9 up 1.00000 1.00000 >> 10 hdd 2.72899 osd.10 up 1.00000 1.00000 >> 11 hdd 3.71989 osd.11 down 0 1.00000 >> 12 hdd 3.63869 osd.12 down 0 1.00000 >> >> cluster: >> id: d0b4c175-02ba-4a64-8040-eb163002cba6 >> health: HEALTH_ERR >> 1 MDSs report slow requests >> 4/4239345 objects unfound (0.000%) >> Too many repaired reads on 3 OSDs >> Reduced data availability: 7 pgs inactive, 7 pgs down >> Possible data damage: 4 pgs recovery_unfound >> Degraded data redundancy: 95807/24738783 objects degraded >> (0.387%), 4 pgs degraded, 3 pgs undersized >> 7 pgs not deep-scrubbed in time >> 7 pgs not scrubbed in time >> >> services: >> mon: 3 daemons, quorum sumib,tower1,sumia (age 4d) >> mgr: sumib(active, since 7d), standbys: sumia, tower1 >> mds: cephfs:1 {0=sumib=up:active} 2 up:standby >> osd: 13 osds: 11 up (since 3d), 10 in (since 4d); 3 remapped pgs >> >> data: >> pools: 5 pools, 256 pgs >> objects: 4.24M objects, 15 TiB >> usage: 24 TiB used, 24 TiB / 47 TiB avail >> pgs: 2.734% pgs not active >> 95807/24738783 objects degraded (0.387%) >> 47910/24738783 objects misplaced (0.194%) >> 4/4239345 objects unfound (0.000%) >> 245 active+clean >> 7 down >> 3 active+recovery_unfound+undersized+degraded+remapped >> 1 active+recovery_unfound+degraded+repair >> >> progress: >> Rebalancing after osd.12 marked out >> [============================..] >> Rebalancing after osd.4 marked out >> [=============================.] >> >> An snipped from an example down pg: >> "up": [ >> 3, >> 2, >> 5, >> 1, >> 8, >> 9 >> ], >> "acting": [ >> 3, >> 2, >> 5, >> 1, >> 8, >> 9 >> ], >> <snip> >> ], >> "blocked": "peering is blocked due to down osds", >> "down_osds_we_would_probe": [ >> 11, >> 12 >> ], >> "peering_blocked_by": [ >> { >> "osd": 11, >> "current_lost_at": 0, >> "comment": "starting or marking this osd lost may let >> us proceed" >> }, >> { >> "osd": 12, >> "current_lost_at": 0, >> "comment": "starting or marking this osd lost may let >> us proceed" >> } >> ] >> }, >> { >> >> Oddly, these OSDs possibly did NOT experience hardware failure. However, >> they won't start -- see pastebin for ceph-osd.11.log >> https://pastebin.com/6U6sQJuJ >> >> >> HEALTH_ERR 1 MDSs report slow requests; 4/4239345 objects unfound (0.000%); >> Too many repaired reads on 3 OSDs; Reduced data availability >> : 7 pgs inactive, 7 pgs down; Possible data damage: 4 pgs recovery_unfound; >> Degraded data redundancy: 95807/24738783 objects degraded (0 >> .387%), 4 pgs degraded, 3 pgs undersized; 7 pgs not deep-scrubbed in time; >> 7 pgs not scrubbed in time >> MDS_SLOW_REQUEST 1 MDSs report slow requests >> mdssumib(mds.0): 42 slow requests are blocked > 30 secs >> OBJECT_UNFOUND 4/4239345 objects unfound (0.000%) >> pg 19.5 has 1 unfound objects >> pg 15.2f has 1 unfound objects >> pg 15.41 has 1 unfound objects >> pg 15.58 has 1 unfound objects >> OSD_TOO_MANY_REPAIRS Too many repaired reads on 3 OSDs >> osd.9 had 9664 reads repaired >> osd.7 had 9665 reads repaired >> osd.4 had 12 reads repaired >> PG_AVAILABILITY Reduced data availability: 7 pgs inactive, 7 pgs down >> pg 15.10 is down, acting [3,2,5,1,8,9] >> pg 15.1e is down, acting [5,1,9,8,2,3] >> pg 15.40 is down, acting [7,10,1,5,3,2] >> pg 15.4a is down, acting [0,3,5,6,9,10] >> pg 15.6a is down, acting [3,2,6,1,10,8] >> pg 15.71 is down, acting [3,2,1,6,8,10] >> pg 15.76 is down, acting [2,0,6,5,10,9] >> PG_DAMAGED Possible data damage: 4 pgs recovery_unfound >> pg 15.2f is active+recovery_unfound+undersized+degraded+remapped, >> acting [5,1,0,3,2147483647,7], 1 unfound >> pg 15.41 is active+recovery_unfound+undersized+degraded+remapped, >> acting [5,1,0,3,2147483647,2147483647], 1 unfound >> pg 15.58 is active+recovery_unfound+undersized+degraded+remapped, >> acting [10,2147483647,2,3,1,5], 1 unfound >> pg 19.5 is active+recovery_unfound+degraded+repair, acting >> [3,2,5,1,8,10], 1 unfound >> PG_DEGRADED Degraded data redundancy: 95807/24738783 objects degraded >> (0.387%), 4 pgs degraded, 3 pgs undersized >> pg 15.2f is stuck undersized for 635305.932075, current state >> active+recovery_unfound+undersized+degraded+remapped, last acting >> [5,1,0,3,2147483647,7] >> pg 15.41 is stuck undersized for 364298.836902, current state >> active+recovery_unfound+undersized+degraded+remapped, last acting >> [5,1,0,3,2147483647,2147483647] >> pg 15.58 is stuck undersized for 384461.110229, current state >> active+recovery_unfound+undersized+degraded+remapped, last acting >> [10,2147483647,2,3,1,5] >> pg 19.5 is active+recovery_unfound+degraded+repair, acting >> [3,2,5,1,8,10], 1 unfound >> PG_NOT_DEEP_SCRUBBED 7 pgs not deep-scrubbed in time >> pg 15.76 not deep-scrubbed since 2020-10-21 14:30:03.935228 >> pg 15.71 not deep-scrubbed since 2020-10-21 12:20:46.235792 >> pg 15.6a not deep-scrubbed since 2020-10-21 07:52:33.914083 >> pg 15.10 not deep-scrubbed since 2020-10-22 03:24:40.465367 >> pg 15.1e not deep-scrubbed since 2020-10-22 10:37:36.169959 >> pg 15.40 not deep-scrubbed since 2020-10-23 05:33:35.208748 >> pg 15.4a not deep-scrubbed since 2020-10-22 05:14:06.981035 >> PG_NOT_SCRUBBED 7 pgs not scrubbed in time >> pg 15.76 not scrubbed since 2020-10-24 08:12:40.090831 >> pg 15.71 not scrubbed since 2020-10-25 05:22:40.573572 >> pg 15.6a not scrubbed since 2020-10-24 15:03:09.189964 >> pg 15.10 not scrubbed since 2020-10-24 16:25:08.826981 >> pg 15.1e not scrubbed since 2020-10-24 16:05:03.080127 >> pg 15.40 not scrubbed since 2020-10-24 11:58:04.290488 >> pg 15.4a not scrubbed since 2020-10-24 11:32:44.573551 >> >> > > -- > Jeremy Austin > jhaustin@xxxxxxxxx > > -- Jeremy Austin jhaustin@xxxxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx