Re: PGs down

Igor Fedotov <ifedotov@xxxxxxx> · Tue, 15 Dec 2020 14:44:44 +0300

Hi Wout,

On 12/15/2020 1:18 PM, Wout van Heeswijk wrote:
Hi Igor,

Are you referring to the bug reports:

- https://tracker.ceph.com/issues/48276 | OSD Crash with ceph_assert(is_valid_io(off, len))
- https://tracker.ceph.com/issues/46800 | Octopus OSD died and fails to start with FAILED ceph_assert(is_valid_io(off, len))

Not exactly. The cause for the above tickets is more or less clear - 
invalid target physical extent attached to write I/O. Most probably the 
target physical offset is out of bounds. The exact bug is still unclear 
though. But given this happens for recent upgrades/deployments only it 
rather relates to new hybrid allocator which became a default recently.

What I meant originally are issues like that:

https://tracker.ceph.com/issues/48443

IIRC there were some similar reports in this mailing list.

Additionally we've had some suspicious reports from our customers saying 
upgrade and/or non-graceful shutdown might result in a broken DB.

But this is pretty unsorted so far will appreciate if users share more 
cases...

General symptoms so far - RocksDB starts to report checksum failures 
with no evident physical media errors after recent upgrade and/or power 
shutdown.

Most of the time such a failure results in OSD unable to start which is 
unfortunately an unrecoverable state AFAIK.

If that is the case, do you think it is wise to revert to the bitmap allocator for up-to-date production clusters? For example for Jeremy's remaining OSDs. (https://tracker.ceph.com/issues/48276#note-6)

As the working hypothesis for "is_valid_io()" assertion assumes that 
hybrid allocator is the culprit it might make sense to revert back to 
bitmap allocator. Particularly for ones who has been suffered from this 
multiple time.

But one should realize that it's still an unverified hypothesis hence I 
wouldn't recommend to unconditionally follow the above advise for 
everybody. And this will definitely complicate the root cause 
investigation.. ;)

Kind regards,

Wout
42on

________________________________________
From: Igor Fedotov <ifedotov@xxxxxxx>
Sent: Monday, 14 December 2020 12:09
To: Jeremy Austin
Cc: ceph-users@xxxxxxx
Subject:  Re: PGs down

Hi Jeremy,

I think you lost the data for OSD.11 & .12  I'm not aware of any
reliable enough way to recover RocksDB from this sort of errors.

Theoretically you might want to disable auto compaction for RocksDB for
these daemons and try to bring then up and attempt to drain the data out
of them to different OSDs then. As currently the log you shared shows
error during compaction there is some chance that during regular
operation OSD wouldn't need broken data (at least for some time). In
fact I've never heard someone tried this approach so this would be a
pretty cutting edge investigation...

Honestly the chance of 100% success is pretty low but some additional
data might be saved.

Back to DB corruption root causes itself.

As it looks like we have some data consistency issues with RocksDB in
the latest Octopus and Nautilus releases I'm currently trying to collect
the stats for the known cases. Hence I'd highly appreciate if you answer
the following questions

1) Have I got this properly that hardware issue happened to the same
node where OSD.11 & .12 are located? Or they are at a different one but
crashed after hardware failure happened to that node and were unable to
start since then?

2) If they're at the same node - do they have standalone DB/WAL volumes?
If so have you checked them for hardware failures as well?

3) Not sure if it makes sense but just in case - have you checked dmesg
output for any disk errors as well?

4) Haven't you performed Ceph upgrade recently. Or more generally - was
the cluster deployed with the current Ceph version or it was an earlier one?

Thanks,

Igor

On 12/14/2020 5:05 AM, Jeremy Austin wrote:
OSD 12 looks much the same.I don't have logs back to the original
date, but this looks very similar — db/sst corruption. The standard
fsck approaches couldn't fix it. I believe it was a form of ATA
failure — OSD 11 and 12, if I recall correctly, did not actually
experience SMARTD-reportable errors. (Essentially, fans died on an
internal SATA enclosure. As the enclosure had no sensor mechanism, I
didn't realize it until drive temps started to climb. I believe most
of the drives survived OK, but the enclosure itself I ultimately had
to completely bypass, even after replacing fans.)

My assumption, once ceph fsck approaches failed, was that I'd need to
mark 11 and 12 (and maybe 4) as lost, but I was reluctant to do so
until I confirmed that I had absolutely lost data beyond recall.

On Sat, Dec 12, 2020 at 10:24 PM Igor Fedotov <ifedotov@xxxxxxx
<mailto:ifedotov@xxxxxxx>> wrote:

     Hi Jeremy,

     wondering what were the OSDs' logs when they crashed for the first
     time?

     And does OSD.12 reports the similar problem for now:

     3> 2020-12-12 20:23:45.756 7f2d21404700 -1 rocksdb: submit_common
     error: Corruption: block checksum mismatch: expected 3113305400,
     got 1242690251 in db/000348.sst offset 47935290 size 4704 code = 2
     Rocksdb transaction:

     ?

     Thanks,
     Igor
     On 12/13/2020 8:48 AM, Jeremy Austin wrote:
     I could use some input from more experienced folks…

     First time seeing this behavior. I've been running ceph in production
     (replicated) since 2016 or earlier.

     This, however, is a small 3-node cluster for testing EC. Crush map rules
     should sustain the loss of an entire node.
     Here's the EC rule:

     rule cephfs425 { id 6 type erasure min_size 3 max_size 6 step
     set_chooseleaf_tries 40 step set_choose_tries 400 step take default step
     choose indep 3 type host step choose indep 2 type osd step emit }

     I had actual hardware failure on one node. Interestingly, this appears to
     have resulted in data loss. OSDs began to crash in a cascade on other nodes
     (i.e., nodes with no known hardware failure). Not a low RAM problem.

     I could use some pointers about how to get the down PGs back up — I *think*
     there are enough EC shards, even disregarding the OSDs that crash on start.

     nautilus 14.2.15

       ceph osd tree
     ID  CLASS WEIGHT   TYPE NAME       STATUS REWEIGHT PRI-AFF
       -1       54.75960 root default
     -10       16.81067     host sumia
        1   hdd  5.57719         osd.1       up  1.00000 1.00000
        5   hdd  5.58469         osd.5       up  1.00000 1.00000
        6   hdd  5.64879         osd.6       up  1.00000 1.00000
       -7       16.73048     host sumib
        0   hdd  5.57899         osd.0       up  1.00000 1.00000
        2   hdd  5.56549         osd.2       up  1.00000 1.00000
        3   hdd  5.58600         osd.3       up  1.00000 1.00000
       -3       21.21844     host tower1
        4   hdd  3.71680         osd.4       up        0 1.00000
        7   hdd  1.84799         osd.7       up  1.00000 1.00000
        8   hdd  3.71680         osd.8       up  1.00000 1.00000
        9   hdd  1.84929         osd.9       up  1.00000 1.00000
       10   hdd  2.72899         osd.10      up  1.00000 1.00000
       11   hdd  3.71989         osd.11    down        0 1.00000
       12   hdd  3.63869         osd.12    down        0 1.00000

        cluster:
          id:     d0b4c175-02ba-4a64-8040-eb163002cba6
          health: HEALTH_ERR
                  1 MDSs report slow requests
                  4/4239345 objects unfound (0.000%)
                  Too many repaired reads on 3 OSDs
                  Reduced data availability: 7 pgs inactive, 7 pgs down
                  Possible data damage: 4 pgs recovery_unfound
                  Degraded data redundancy: 95807/24738783 objects degraded
     (0.387%), 4 pgs degraded, 3 pgs undersized
                  7 pgs not deep-scrubbed in time
                  7 pgs not scrubbed in time

        services:
          mon: 3 daemons, quorum sumib,tower1,sumia (age 4d)
          mgr: sumib(active, since 7d), standbys: sumia, tower1
          mds: cephfs:1 {0=sumib=up:active} 2 up:standby
          osd: 13 osds: 11 up (since 3d), 10 in (since 4d); 3 remapped pgs

        data:
          pools:   5 pools, 256 pgs
          objects: 4.24M objects, 15 TiB
          usage:   24 TiB used, 24 TiB / 47 TiB avail
          pgs:     2.734% pgs not active
                   95807/24738783 objects degraded (0.387%)
                   47910/24738783 objects misplaced (0.194%)
                   4/4239345 objects unfound (0.000%)
                   245 active+clean
                   7   down
                   3   active+recovery_unfound+undersized+degraded+remapped
                   1   active+recovery_unfound+degraded+repair

        progress:
          Rebalancing after osd.12 marked out
            [============================..]
          Rebalancing after osd.4 marked out
            [=============================.]

     An snipped from an example down pg:
          "up": [
              3,
              2,
              5,
              1,
              8,
              9
          ],
          "acting": [
              3,
              2,
              5,
              1,
              8,
              9
          ],
     <snip>
               ],
                  "blocked": "peering is blocked due to down osds",
                  "down_osds_we_would_probe": [
                      11,
                      12
                  ],
                  "peering_blocked_by": [
                      {
                          "osd": 11,
                          "current_lost_at": 0,
                          "comment": "starting or marking this osd lost may let
     us proceed"
                      },
                      {
                          "osd": 12,
                          "current_lost_at": 0,
                          "comment": "starting or marking this osd lost may let
     us proceed"
                      }
                  ]
              },
              {

     Oddly, these OSDs possibly did NOT experience hardware failure. However,
     they won't start -- see pastebin for ceph-osd.11.log

     https://pastebin.com/6U6sQJuJ  <https://pastebin.com/6U6sQJuJ>

     HEALTH_ERR 1 MDSs report slow requests; 4/4239345 objects unfound (0.000%);
     Too many repaired reads on 3 OSDs; Reduced data availability
     : 7 pgs inactive, 7 pgs down; Possible data damage: 4 pgs recovery_unfound;
     Degraded data redundancy: 95807/24738783 objects degraded (0
     .387%), 4 pgs degraded, 3 pgs undersized; 7 pgs not deep-scrubbed in time;
     7 pgs not scrubbed in time
     MDS_SLOW_REQUEST 1 MDSs report slow requests
          mdssumib(mds.0): 42 slow requests are blocked > 30 secs
     OBJECT_UNFOUND 4/4239345 objects unfound (0.000%)
          pg 19.5 has 1 unfound objects
          pg 15.2f has 1 unfound objects
          pg 15.41 has 1 unfound objects
          pg 15.58 has 1 unfound objects
     OSD_TOO_MANY_REPAIRS Too many repaired reads on 3 OSDs
          osd.9 had 9664 reads repaired
          osd.7 had 9665 reads repaired
          osd.4 had 12 reads repaired
     PG_AVAILABILITY Reduced data availability: 7 pgs inactive, 7 pgs down
          pg 15.10 is down, acting [3,2,5,1,8,9]
          pg 15.1e is down, acting [5,1,9,8,2,3]
          pg 15.40 is down, acting [7,10,1,5,3,2]
          pg 15.4a is down, acting [0,3,5,6,9,10]
          pg 15.6a is down, acting [3,2,6,1,10,8]
          pg 15.71 is down, acting [3,2,1,6,8,10]
          pg 15.76 is down, acting [2,0,6,5,10,9]
     PG_DAMAGED Possible data damage: 4 pgs recovery_unfound
          pg 15.2f is active+recovery_unfound+undersized+degraded+remapped,
     acting [5,1,0,3,2147483647,7], 1 unfound
          pg 15.41 is active+recovery_unfound+undersized+degraded+remapped,
     acting [5,1,0,3,2147483647,2147483647], 1 unfound
          pg 15.58 is active+recovery_unfound+undersized+degraded+remapped,
     acting [10,2147483647,2,3,1,5], 1 unfound
          pg 19.5 is active+recovery_unfound+degraded+repair, acting
     [3,2,5,1,8,10], 1 unfound
     PG_DEGRADED Degraded data redundancy: 95807/24738783 objects degraded
     (0.387%), 4 pgs degraded, 3 pgs undersized
          pg 15.2f is stuck undersized for 635305.932075, current state
     active+recovery_unfound+undersized+degraded+remapped, last acting
     [5,1,0,3,2147483647,7]
          pg 15.41 is stuck undersized for 364298.836902, current state
     active+recovery_unfound+undersized+degraded+remapped, last acting
     [5,1,0,3,2147483647,2147483647]
          pg 15.58 is stuck undersized for 384461.110229, current state
     active+recovery_unfound+undersized+degraded+remapped, last acting
     [10,2147483647,2,3,1,5]
          pg 19.5 is active+recovery_unfound+degraded+repair, acting
     [3,2,5,1,8,10], 1 unfound
     PG_NOT_DEEP_SCRUBBED 7 pgs not deep-scrubbed in time
          pg 15.76 not deep-scrubbed since 2020-10-21 14:30:03.935228
          pg 15.71 not deep-scrubbed since 2020-10-21 12:20:46.235792
          pg 15.6a not deep-scrubbed since 2020-10-21 07:52:33.914083
          pg 15.10 not deep-scrubbed since 2020-10-22 03:24:40.465367
          pg 15.1e not deep-scrubbed since 2020-10-22 10:37:36.169959
          pg 15.40 not deep-scrubbed since 2020-10-23 05:33:35.208748
          pg 15.4a not deep-scrubbed since 2020-10-22 05:14:06.981035
     PG_NOT_SCRUBBED 7 pgs not scrubbed in time
          pg 15.76 not scrubbed since 2020-10-24 08:12:40.090831
          pg 15.71 not scrubbed since 2020-10-25 05:22:40.573572
          pg 15.6a not scrubbed since 2020-10-24 15:03:09.189964
          pg 15.10 not scrubbed since 2020-10-24 16:25:08.826981
          pg 15.1e not scrubbed since 2020-10-24 16:05:03.080127
          pg 15.40 not scrubbed since 2020-10-24 11:58:04.290488
          pg 15.4a not scrubbed since 2020-10-24 11:32:44.573551

--
Jeremy Austin
jhaustin@xxxxxxxxx <mailto:jhaustin@xxxxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Thanks,

Igor
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx