| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
|
On 3/13/20 5:44 PM, Peter Eisch wrote:
> On 3/13/20, 11:38 AM, "Wido den Hollander" <wido@xxxxxxxx> wrote:
>
> This email originates outside Virgin Pulse.
>
>
> On 3/13/20 4:09 PM, Peter Eisch wrote:
>> Full cluster is 14.2.8.
>>
>> I had some OSD drop overnight which results now in 4 inactive PGs. The
>> pools had three participant (2 ssd, 1 sas) OSDs. In each pool at least 1
>> ssd and 1 sas OSD is working without issue. I’ve ‘ceph pg repair <pg>’
>> but it doesn’t seem to make any changes.
>>
>> PG_AVAILABILITY Reduced data availability: 4 pgs inactive, 4 pgs
> incomplete
>> pg 10.2e is incomplete, acting [59,67]
>> pg 10.c3 is incomplete, acting [62,105]
>> pg 10.f3 is incomplete, acting [62,59]
>> pg 10.1d5 is incomplete, acting [87,106]
>>
>> Using `ceph pg <pg> query` I can see the OSD in each case of the ones
>> which failed. Respectively they are:
>> pg 10.2e participants: 59, 68, 77, 143
>> pg 10.c3 participants: 60, 62, 85, 102, 105, 106
>> pg 10.f3 participants: 59, 64, 75, 107
>> pg 10.1d5 participants: 64, 77, 87, 106
>>
>> The OSDs which are now down/out and have been removed from the crush map
>> and removed the auth are:
>> 62, 64, 68
>>
>> Of course I have lots of reports of slow OSDs now from OSDs worried
>> about the inactive PGs.
>>
>> How do I properly kick these PGs to have them drop their usage of the
>> OSDs which no longer exist?
>
> You don't. Because those OSDs hold the data you need.
>
> Why did you remove them from the CRUSHMap, OSDMap and auth? As you need
> these to rebuild the PGs.
>
> Wido
>
> The drives failed at a hardware level. I've replaced OSDs with this by
> either planned migration or failure in previous instances without issue.
> I didn't realize all the replicated copies were on just one drive in
> each pool.
> > What should my actions have been in this case?
Try to get those OSDs online again. Maybe try a rescue of the disks or
see how the OSDs would be able to start.
A tool like dd_rescue can help in getting such a thing done.
>
> pool 10 volumes' replicated size 2 min_size 1 crush_rule 1 object_hash
> rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 47570
> lfor 0/0/40781 flags hashpspool,selfmanaged_snaps stripe_width 0
> application rbd
I see you use 2x replication with min_size=1, that's dangerous and can
easily lead to data loss.
I wouldn't say it's impossible to get the data back, but something like
this can take a while (a lot of hours) to be brought back online.
The three NVMe drives which failed within 10 mins of each other spent the last day at Kroll/OnTrack for recovery. They can't do anything with them. Apparently they fell to a bug in the NVMe firmware which was fixed but not it never got applied. (It might be worth nothing that three more NVMe drives died within 48 hours before I could get them all 'out' but they staggered themselves so things could backfill.
I'm willing to accept the data loss at this point for these four PGs. What can I do to zero these out or get even just tag them as complete so we can get our filesystems back into service (and do diligence with fsck/chkdsk/etc.)?
[@cephmon]# ceph pg ls incomplete
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
10.c3 0 0 0 0 0 0 0 0 incomplete 16s 0'0 67570:16611 [84,119]p84 [84,119]p84 2020-03-13 00:06:12.356259 2020-03-11 13:04:17.124901
10.1d5 13882 0 0 0 58201653248 0 0 3063 incomplete 16s 48617'19136670 67570:76106823 [87,77]p87 [87,77]p87 2020-03-12 21:00:43.540659 2020-03-12 21:00:43.540659
* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.
[@cephmon]# ceph pg ls down
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
10.2e 88 0 0 0 373293056 0 0 3001 down 20s 49315'16087499 67570:33657 [77,143]p77 [77,143]p77 2020-03-12 07:55:04.030384 2020-03-05 10:26:43.183563
10.f3 244 0 0 0 1027604480 0 0 3015 down 20s 48741'18343076 67570:34213 [75,139]p75 [75,139]p75 2020-03-13 02:32:20.026885 2020-03-13 02:32:20.026885
* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.
[@cephmon]#
Again, 62, 64 and 68 were the OSDs which died and it's clearly now trying to use others. And yes, I can bump the size to 3 going forward but we need to get past these guys first.
What should be my next step?
Thanks!
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx