Re: Inactive PGs

Peter Eisch <peter.eisch@xxxxxxxxxxxxxxx> · Fri, 13 Mar 2020 16:44:09 +0000

Peter Eisch
Senior Site Reliability Engineer
T 1.612.445.5135
virginpulse.com
| virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message.
v2.64
On 3/13/20, 11:38 AM, "Wido den Hollander" <wido@xxxxxxxx> wrote:

    This email originates outside Virgin Pulse.

    On 3/13/20 4:09 PM, Peter Eisch wrote:
    > Full cluster is 14.2.8.
    >
    > I had some OSD drop overnight which results now in 4 inactive PGs. The
    > pools had three participant (2 ssd, 1 sas) OSDs. In each pool at least 1
    > ssd and 1 sas OSD is working without issue. I’ve ‘ceph pg repair <pg>’
    > but it doesn’t seem to make any changes.
    >
    > PG_AVAILABILITY Reduced data availability: 4 pgs inactive, 4 pgs incomplete
    > pg 10.2e is incomplete, acting [59,67]
    > pg 10.c3 is incomplete, acting [62,105]
    > pg 10.f3 is incomplete, acting [62,59]
    > pg 10.1d5 is incomplete, acting [87,106]
    >
    > Using `ceph pg <pg> query` I can see the OSD in each case of the ones
    > which failed. Respectively they are:
    > pg 10.2e participants: 59, 68, 77, 143
    > pg 10.c3 participants: 60, 62, 85, 102, 105, 106
    > pg 10.f3 participants: 59, 64, 75, 107
    > pg 10.1d5 participants: 64, 77, 87, 106
    >
    > The OSDs which are now down/out and have been removed from the crush map
    > and removed the auth are:
    > 62, 64, 68
    >
    > Of course I have lots of reports of slow OSDs now from OSDs worried
    > about the inactive PGs.
    >
    > How do I properly kick these PGs to have them drop their usage of the
    > OSDs which no longer exist?

    You don't. Because those OSDs hold the data you need.

    Why did  you remove them from the CRUSHMap, OSDMap and auth? As you need
    these to rebuild the PGs.

    Wido

The drives failed at a hardware level.  I've replaced OSDs with this by either planned migration or failure in previous instances without issue.  I didn't realize all the replicated copies were on just one drive in each pool.

What should my actions have been in this case?

  pool 10 volumes' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 47570 lfor 0/0/40781 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

Crush rule 1: 
rule ssd_by_host {
	id 1
	type replicated
	min_size 1
	max_size 10
	step take default class ssd
	step chooseleaf firstn 0 type host
	step emit
}

peter    

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx