Re: Inactive PGs

Peter Eisch <peter.eisch@xxxxxxxxxxxxxxx> · Tue, 17 Mar 2020 19:57:00 +0000

Peter Eisch
Senior Site Reliability Engineer
T 1.612.445.5135
virginpulse.com
| virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message.
v2.64
On 3/13/20, 11:47 AM, "Wido den Hollander" <wido@xxxxxxxx> wrote:

    On 3/13/20 5:44 PM, Peter Eisch wrote:
    > On 3/13/20, 11:38 AM, "Wido den Hollander" <wido@xxxxxxxx> wrote:
    > 
    > This email originates outside Virgin Pulse.
    > 
    > 
    > On 3/13/20 4:09 PM, Peter Eisch wrote:
    >> Full cluster is 14.2.8.
    >>
    >> I had some OSD drop overnight which results now in 4 inactive PGs. The
    >> pools had three participant (2 ssd, 1 sas) OSDs. In each pool at least 1
    >> ssd and 1 sas OSD is working without issue. I’ve ‘ceph pg repair <pg>’
    >> but it doesn’t seem to make any changes.
    >>
    >> PG_AVAILABILITY Reduced data availability: 4 pgs inactive, 4 pgs
    > incomplete
    >> pg 10.2e is incomplete, acting [59,67]
    >> pg 10.c3 is incomplete, acting [62,105]
    >> pg 10.f3 is incomplete, acting [62,59]
    >> pg 10.1d5 is incomplete, acting [87,106]
    >>
    >> Using `ceph pg <pg> query` I can see the OSD in each case of the ones
    >> which failed. Respectively they are:
    >> pg 10.2e participants: 59, 68, 77, 143
    >> pg 10.c3 participants: 60, 62, 85, 102, 105, 106
    >> pg 10.f3 participants: 59, 64, 75, 107
    >> pg 10.1d5 participants: 64, 77, 87, 106
    >>
    >> The OSDs which are now down/out and have been removed from the crush map
    >> and removed the auth are:
    >> 62, 64, 68
    >>
    >> Of course I have lots of reports of slow OSDs now from OSDs worried
    >> about the inactive PGs.
    >>
    >> How do I properly kick these PGs to have them drop their usage of the
    >> OSDs which no longer exist?
    > 
    > You don't. Because those OSDs hold the data you need.
    > 
    > Why did you remove them from the CRUSHMap, OSDMap and auth? As you need
    > these to rebuild the PGs.
    > 
    > Wido
    > 
    > The drives failed at a hardware level. I've replaced OSDs with this by
    > either planned migration or failure in previous instances without issue.
    > I didn't realize all the replicated copies were on just one drive in
    > each pool.
    > > What should my actions have been in this case?

    Try to get those OSDs online again. Maybe try a rescue of the disks or
    see how the OSDs would be able to start.

    A tool like dd_rescue can help in getting such a thing done.

    > 
    > pool 10 volumes' replicated size 2 min_size 1 crush_rule 1 object_hash
    > rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 47570
    > lfor 0/0/40781 flags hashpspool,selfmanaged_snaps stripe_width 0
    > application rbd

    I see you use 2x replication with min_size=1, that's dangerous and can
    easily lead to data loss.

    I wouldn't say it's impossible to get the data back, but something like
    this can take a while (a lot of hours) to be brought back online.

The three NVMe drives which failed within 10 mins of each other spent the last day at Kroll/OnTrack for recovery.  They can't do anything with them.  Apparently they fell to a bug in the NVMe firmware which was fixed but not it never got applied.  (It might be worth nothing that three more NVMe drives died within 48 hours before I could get them all 'out' but they staggered themselves so things could backfill.

I'm willing to accept the data loss at this point for these four PGs.  What can I do to zero these out or get even just tag them as complete so we can get our filesystems back into service (and do diligence with fsck/chkdsk/etc.)?

[@cephmon]# ceph pg ls incomplete
PG     OBJECTS DEGRADED MISPLACED UNFOUND BYTES       OMAP_BYTES* OMAP_KEYS* LOG  STATE      SINCE VERSION        REPORTED       UP          ACTING      SCRUB_STAMP                DEEP_SCRUB_STAMP           
10.c3        0        0         0       0           0           0          0    0 incomplete   16s            0'0    67570:16611 [84,119]p84 [84,119]p84 2020-03-13 00:06:12.356259 2020-03-11 13:04:17.124901 
10.1d5   13882        0         0       0 58201653248           0          0 3063 incomplete   16s 48617'19136670 67570:76106823  [87,77]p87  [87,77]p87 2020-03-12 21:00:43.540659 2020-03-12 21:00:43.540659 

* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.
[@cephmon]# ceph pg ls down
PG    OBJECTS DEGRADED MISPLACED UNFOUND BYTES      OMAP_BYTES* OMAP_KEYS* LOG  STATE SINCE VERSION        REPORTED    UP          ACTING      SCRUB_STAMP                DEEP_SCRUB_STAMP           
10.2e      88        0         0       0  373293056           0          0 3001  down   20s 49315'16087499 67570:33657 [77,143]p77 [77,143]p77 2020-03-12 07:55:04.030384 2020-03-05 10:26:43.183563 
10.f3     244        0         0       0 1027604480           0          0 3015  down   20s 48741'18343076 67570:34213 [75,139]p75 [75,139]p75 2020-03-13 02:32:20.026885 2020-03-13 02:32:20.026885 

* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.
[@cephmon]#

Again, 62, 64 and 68 were the OSDs which died and it's clearly now trying to use others.  And yes, I can bump the size to 3 going forward but we need to get past these guys first.

What should be my next step?

Thanks!

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx