On 3/13/20 5:44 PM, Peter Eisch wrote: > > > > Peter Eisch > Senior Site Reliability Engineer > > T > > *1.612.445.5135* <tel:1.612.445.5135> > > Facebook <https://www.facebook.com/VirginPulse> > > > LinkedIn <https://www.linkedin.com/company/virgin-pulse> > > > Twitter <https://twitter.com/virginpulse> > > *virginpulse.com* <https://www.virginpulse.com/> > > | > > *virginpulse.com/global-challenge* > <https://www.virginpulse.com/en-gb/global-challenge/> > > Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA > > Confidentiality Notice: The information contained in this e-mail, > including any attachment(s), is intended solely for use by the > designated recipient(s). Unauthorized use, dissemination, distribution, > or reproduction of this message by anyone other than the intended > recipient(s), or a person designated as responsible for delivering such > messages to the intended recipient, is strictly prohibited and may be > unlawful. This e-mail may contain proprietary, confidential or > privileged information. Any views or opinions expressed are solely those > of the author and do not necessarily represent those of Virgin Pulse, > Inc. If you have received this message in error, or are not the named > recipient(s), please immediately notify the sender and delete this > e-mail message. > > v2.64 > > On 3/13/20, 11:38 AM, "Wido den Hollander" <wido@xxxxxxxx> wrote: > > This email originates outside Virgin Pulse. > > > On 3/13/20 4:09 PM, Peter Eisch wrote: >> Full cluster is 14.2.8. >> >> I had some OSD drop overnight which results now in 4 inactive PGs. The >> pools had three participant (2 ssd, 1 sas) OSDs. In each pool at least 1 >> ssd and 1 sas OSD is working without issue. I’ve ‘ceph pg repair <pg>’ >> but it doesn’t seem to make any changes. >> >> PG_AVAILABILITY Reduced data availability: 4 pgs inactive, 4 pgs > incomplete >> pg 10.2e is incomplete, acting [59,67] >> pg 10.c3 is incomplete, acting [62,105] >> pg 10.f3 is incomplete, acting [62,59] >> pg 10.1d5 is incomplete, acting [87,106] >> >> Using `ceph pg <pg> query` I can see the OSD in each case of the ones >> which failed. Respectively they are: >> pg 10.2e participants: 59, 68, 77, 143 >> pg 10.c3 participants: 60, 62, 85, 102, 105, 106 >> pg 10.f3 participants: 59, 64, 75, 107 >> pg 10.1d5 participants: 64, 77, 87, 106 >> >> The OSDs which are now down/out and have been removed from the crush map >> and removed the auth are: >> 62, 64, 68 >> >> Of course I have lots of reports of slow OSDs now from OSDs worried >> about the inactive PGs. >> >> How do I properly kick these PGs to have them drop their usage of the >> OSDs which no longer exist? > > You don't. Because those OSDs hold the data you need. > > Why did you remove them from the CRUSHMap, OSDMap and auth? As you need > these to rebuild the PGs. > > Wido > > The drives failed at a hardware level. I've replaced OSDs with this by > either planned migration or failure in previous instances without issue. > I didn't realize all the replicated copies were on just one drive in > each pool. > > What should my actions have been in this case? Try to get those OSDs online again. Maybe try a rescue of the disks or see how the OSDs would be able to start. A tool like dd_rescue can help in getting such a thing done. > > pool 10 volumes' replicated size 2 min_size 1 crush_rule 1 object_hash > rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 47570 > lfor 0/0/40781 flags hashpspool,selfmanaged_snaps stripe_width 0 > application rbd I see you use 2x replication with min_size=1, that's dangerous and can easily lead to data loss. I wouldn't say it's impossible to get the data back, but something like this can take a while (a lot of hours) to be brought back online. Wido > > Crush rule 1: > rule ssd_by_host { > id 1 > type replicated > min_size 1 > max_size 10 > step take default class ssd > step chooseleaf firstn 0 type host > step emit > } > > peter > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx