Hi Igor, Are you referring to the bug reports: - https://tracker.ceph.com/issues/48276 | OSD Crash with ceph_assert(is_valid_io(off, len)) - https://tracker.ceph.com/issues/46800 | Octopus OSD died and fails to start with FAILED ceph_assert(is_valid_io(off, len)) If that is the case, do you think it is wise to revert to the bitmap allocator for up-to-date production clusters? For example for Jeremy's remaining OSDs. (https://tracker.ceph.com/issues/48276#note-6) Kind regards, Wout 42on ________________________________________ From: Igor Fedotov <ifedotov@xxxxxxx> Sent: Monday, 14 December 2020 12:09 To: Jeremy Austin Cc: ceph-users@xxxxxxx Subject: Re: PGs down Hi Jeremy, I think you lost the data for OSD.11 & .12 I'm not aware of any reliable enough way to recover RocksDB from this sort of errors. Theoretically you might want to disable auto compaction for RocksDB for these daemons and try to bring then up and attempt to drain the data out of them to different OSDs then. As currently the log you shared shows error during compaction there is some chance that during regular operation OSD wouldn't need broken data (at least for some time). In fact I've never heard someone tried this approach so this would be a pretty cutting edge investigation... Honestly the chance of 100% success is pretty low but some additional data might be saved. Back to DB corruption root causes itself. As it looks like we have some data consistency issues with RocksDB in the latest Octopus and Nautilus releases I'm currently trying to collect the stats for the known cases. Hence I'd highly appreciate if you answer the following questions 1) Have I got this properly that hardware issue happened to the same node where OSD.11 & .12 are located? Or they are at a different one but crashed after hardware failure happened to that node and were unable to start since then? 2) If they're at the same node - do they have standalone DB/WAL volumes? If so have you checked them for hardware failures as well? 3) Not sure if it makes sense but just in case - have you checked dmesg output for any disk errors as well? 4) Haven't you performed Ceph upgrade recently. Or more generally - was the cluster deployed with the current Ceph version or it was an earlier one? Thanks, Igor On 12/14/2020 5:05 AM, Jeremy Austin wrote: > OSD 12 looks much the same.I don't have logs back to the original > date, but this looks very similar — db/sst corruption. The standard > fsck approaches couldn't fix it. I believe it was a form of ATA > failure — OSD 11 and 12, if I recall correctly, did not actually > experience SMARTD-reportable errors. (Essentially, fans died on an > internal SATA enclosure. As the enclosure had no sensor mechanism, I > didn't realize it until drive temps started to climb. I believe most > of the drives survived OK, but the enclosure itself I ultimately had > to completely bypass, even after replacing fans.) > > My assumption, once ceph fsck approaches failed, was that I'd need to > mark 11 and 12 (and maybe 4) as lost, but I was reluctant to do so > until I confirmed that I had absolutely lost data beyond recall. > > On Sat, Dec 12, 2020 at 10:24 PM Igor Fedotov <ifedotov@xxxxxxx > <mailto:ifedotov@xxxxxxx>> wrote: > > Hi Jeremy, > > wondering what were the OSDs' logs when they crashed for the first > time? > > And does OSD.12 reports the similar problem for now: > > 3> 2020-12-12 20:23:45.756 7f2d21404700 -1 rocksdb: submit_common > error: Corruption: block checksum mismatch: expected 3113305400, > got 1242690251 in db/000348.sst offset 47935290 size 4704 code = 2 > Rocksdb transaction: > > ? > > Thanks, > Igor > On 12/13/2020 8:48 AM, Jeremy Austin wrote: >> I could use some input from more experienced folks… >> >> First time seeing this behavior. I've been running ceph in production >> (replicated) since 2016 or earlier. >> >> This, however, is a small 3-node cluster for testing EC. Crush map rules >> should sustain the loss of an entire node. >> Here's the EC rule: >> >> rule cephfs425 { id 6 type erasure min_size 3 max_size 6 step >> set_chooseleaf_tries 40 step set_choose_tries 400 step take default step >> choose indep 3 type host step choose indep 2 type osd step emit } >> >> >> I had actual hardware failure on one node. Interestingly, this appears to >> have resulted in data loss. OSDs began to crash in a cascade on other nodes >> (i.e., nodes with no known hardware failure). Not a low RAM problem. >> >> I could use some pointers about how to get the down PGs back up — I *think* >> there are enough EC shards, even disregarding the OSDs that crash on start. >> >> nautilus 14.2.15 >> >> ceph osd tree >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 54.75960 root default >> -10 16.81067 host sumia >> 1 hdd 5.57719 osd.1 up 1.00000 1.00000 >> 5 hdd 5.58469 osd.5 up 1.00000 1.00000 >> 6 hdd 5.64879 osd.6 up 1.00000 1.00000 >> -7 16.73048 host sumib >> 0 hdd 5.57899 osd.0 up 1.00000 1.00000 >> 2 hdd 5.56549 osd.2 up 1.00000 1.00000 >> 3 hdd 5.58600 osd.3 up 1.00000 1.00000 >> -3 21.21844 host tower1 >> 4 hdd 3.71680 osd.4 up 0 1.00000 >> 7 hdd 1.84799 osd.7 up 1.00000 1.00000 >> 8 hdd 3.71680 osd.8 up 1.00000 1.00000 >> 9 hdd 1.84929 osd.9 up 1.00000 1.00000 >> 10 hdd 2.72899 osd.10 up 1.00000 1.00000 >> 11 hdd 3.71989 osd.11 down 0 1.00000 >> 12 hdd 3.63869 osd.12 down 0 1.00000 >> >> cluster: >> id: d0b4c175-02ba-4a64-8040-eb163002cba6 >> health: HEALTH_ERR >> 1 MDSs report slow requests >> 4/4239345 objects unfound (0.000%) >> Too many repaired reads on 3 OSDs >> Reduced data availability: 7 pgs inactive, 7 pgs down >> Possible data damage: 4 pgs recovery_unfound >> Degraded data redundancy: 95807/24738783 objects degraded >> (0.387%), 4 pgs degraded, 3 pgs undersized >> 7 pgs not deep-scrubbed in time >> 7 pgs not scrubbed in time >> >> services: >> mon: 3 daemons, quorum sumib,tower1,sumia (age 4d) >> mgr: sumib(active, since 7d), standbys: sumia, tower1 >> mds: cephfs:1 {0=sumib=up:active} 2 up:standby >> osd: 13 osds: 11 up (since 3d), 10 in (since 4d); 3 remapped pgs >> >> data: >> pools: 5 pools, 256 pgs >> objects: 4.24M objects, 15 TiB >> usage: 24 TiB used, 24 TiB / 47 TiB avail >> pgs: 2.734% pgs not active >> 95807/24738783 objects degraded (0.387%) >> 47910/24738783 objects misplaced (0.194%) >> 4/4239345 objects unfound (0.000%) >> 245 active+clean >> 7 down >> 3 active+recovery_unfound+undersized+degraded+remapped >> 1 active+recovery_unfound+degraded+repair >> >> progress: >> Rebalancing after osd.12 marked out >> [============================..] >> Rebalancing after osd.4 marked out >> [=============================.] >> >> An snipped from an example down pg: >> "up": [ >> 3, >> 2, >> 5, >> 1, >> 8, >> 9 >> ], >> "acting": [ >> 3, >> 2, >> 5, >> 1, >> 8, >> 9 >> ], >> <snip> >> ], >> "blocked": "peering is blocked due to down osds", >> "down_osds_we_would_probe": [ >> 11, >> 12 >> ], >> "peering_blocked_by": [ >> { >> "osd": 11, >> "current_lost_at": 0, >> "comment": "starting or marking this osd lost may let >> us proceed" >> }, >> { >> "osd": 12, >> "current_lost_at": 0, >> "comment": "starting or marking this osd lost may let >> us proceed" >> } >> ] >> }, >> { >> >> Oddly, these OSDs possibly did NOT experience hardware failure. However, >> they won't start -- see pastebin for ceph-osd.11.log >> >> https://pastebin.com/6U6sQJuJ <https://pastebin.com/6U6sQJuJ> >> >> >> HEALTH_ERR 1 MDSs report slow requests; 4/4239345 objects unfound (0.000%); >> Too many repaired reads on 3 OSDs; Reduced data availability >> : 7 pgs inactive, 7 pgs down; Possible data damage: 4 pgs recovery_unfound; >> Degraded data redundancy: 95807/24738783 objects degraded (0 >> .387%), 4 pgs degraded, 3 pgs undersized; 7 pgs not deep-scrubbed in time; >> 7 pgs not scrubbed in time >> MDS_SLOW_REQUEST 1 MDSs report slow requests >> mdssumib(mds.0): 42 slow requests are blocked > 30 secs >> OBJECT_UNFOUND 4/4239345 objects unfound (0.000%) >> pg 19.5 has 1 unfound objects >> pg 15.2f has 1 unfound objects >> pg 15.41 has 1 unfound objects >> pg 15.58 has 1 unfound objects >> OSD_TOO_MANY_REPAIRS Too many repaired reads on 3 OSDs >> osd.9 had 9664 reads repaired >> osd.7 had 9665 reads repaired >> osd.4 had 12 reads repaired >> PG_AVAILABILITY Reduced data availability: 7 pgs inactive, 7 pgs down >> pg 15.10 is down, acting [3,2,5,1,8,9] >> pg 15.1e is down, acting [5,1,9,8,2,3] >> pg 15.40 is down, acting [7,10,1,5,3,2] >> pg 15.4a is down, acting [0,3,5,6,9,10] >> pg 15.6a is down, acting [3,2,6,1,10,8] >> pg 15.71 is down, acting [3,2,1,6,8,10] >> pg 15.76 is down, acting [2,0,6,5,10,9] >> PG_DAMAGED Possible data damage: 4 pgs recovery_unfound >> pg 15.2f is active+recovery_unfound+undersized+degraded+remapped, >> acting [5,1,0,3,2147483647,7], 1 unfound >> pg 15.41 is active+recovery_unfound+undersized+degraded+remapped, >> acting [5,1,0,3,2147483647,2147483647], 1 unfound >> pg 15.58 is active+recovery_unfound+undersized+degraded+remapped, >> acting [10,2147483647,2,3,1,5], 1 unfound >> pg 19.5 is active+recovery_unfound+degraded+repair, acting >> [3,2,5,1,8,10], 1 unfound >> PG_DEGRADED Degraded data redundancy: 95807/24738783 objects degraded >> (0.387%), 4 pgs degraded, 3 pgs undersized >> pg 15.2f is stuck undersized for 635305.932075, current state >> active+recovery_unfound+undersized+degraded+remapped, last acting >> [5,1,0,3,2147483647,7] >> pg 15.41 is stuck undersized for 364298.836902, current state >> active+recovery_unfound+undersized+degraded+remapped, last acting >> [5,1,0,3,2147483647,2147483647] >> pg 15.58 is stuck undersized for 384461.110229, current state >> active+recovery_unfound+undersized+degraded+remapped, last acting >> [10,2147483647,2,3,1,5] >> pg 19.5 is active+recovery_unfound+degraded+repair, acting >> [3,2,5,1,8,10], 1 unfound >> PG_NOT_DEEP_SCRUBBED 7 pgs not deep-scrubbed in time >> pg 15.76 not deep-scrubbed since 2020-10-21 14:30:03.935228 >> pg 15.71 not deep-scrubbed since 2020-10-21 12:20:46.235792 >> pg 15.6a not deep-scrubbed since 2020-10-21 07:52:33.914083 >> pg 15.10 not deep-scrubbed since 2020-10-22 03:24:40.465367 >> pg 15.1e not deep-scrubbed since 2020-10-22 10:37:36.169959 >> pg 15.40 not deep-scrubbed since 2020-10-23 05:33:35.208748 >> pg 15.4a not deep-scrubbed since 2020-10-22 05:14:06.981035 >> PG_NOT_SCRUBBED 7 pgs not scrubbed in time >> pg 15.76 not scrubbed since 2020-10-24 08:12:40.090831 >> pg 15.71 not scrubbed since 2020-10-25 05:22:40.573572 >> pg 15.6a not scrubbed since 2020-10-24 15:03:09.189964 >> pg 15.10 not scrubbed since 2020-10-24 16:25:08.826981 >> pg 15.1e not scrubbed since 2020-10-24 16:05:03.080127 >> pg 15.40 not scrubbed since 2020-10-24 11:58:04.290488 >> pg 15.4a not scrubbed since 2020-10-24 11:32:44.573551 > > > > -- > Jeremy Austin > jhaustin@xxxxxxxxx <mailto:jhaustin@xxxxxxxxx> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx