Re: fixing another remapped+incomplete EC 4+2 pg

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 18, 2018 at 2:28 PM Graham Allan <gta@xxxxxxx> wrote:
Thanks Greg,

This did get resolved though I'm not 100% certain why!

For one of the suspect shards which caused crash on backfill, I
attempted to delete the associated via s3, late last week. I then
examined the filestore OSDs and the file shards were still present...
maybe for an hour following (after which I stopped looking).

I left the cluster set to nobackfill over the weekend, during which time
all osds kept running; then on Monday morning re-enabled backfill. I
expected the osd to crash again. after which I could look into moving or
deleting the implicated backfill shards out of the way. Instead of which
it happily backfilled its way to cleanliness.

I suppose it's possible the shards got deleted later in some kind of rgw
gc operation, and this could have cleared the problem? Unfortunately I
didn't look for them again before re-enabling backfill. I'm not sure if
that's how s3 object deletion works - does it make any sense?

Yes, RGW generates garbage collection logs that are followed along and perform actual object deletes separately from when they're marked deleted in the S3 protocol. I don't know the details of the process but it's entirely plausible that it just went through and deleted all the bad objects during that time period.
-Greg
 

The only other thing I did late last week was notice that one of the
active osds for the pg seemed very slow to respond - the drive was
clearly failing. I was never getting any actual i/o errors at the user
or osd level, though it did trigger a 24-hour deathwatch SMART warning a
bit later.

I exported the pg shard from the failing osd, and re-imported it to
another otherwise-evacuated osd. This was just for data safety; it seems
really unlikely this could be causing the other osds in the pg to crash...

Graham

On 10/15/2018 01:44 PM, Gregory Farnum wrote:
>
>
> On Thu, Oct 11, 2018 at 3:22 PM Graham Allan <gta@xxxxxxx
>
>     As the osd crash implies, setting "nobackfill" appears to let all the
>     osds keep running and the pg stays active and can apparently serve data.
>
>     If I track down the object referenced below in the object store, I can
>     download it without error via s3... though as I can't generate a
>     matching etag, it may well be corrupt.
>
>     Still I do wonder if deleting this object - either via s3, or maybe
>     more
>     likely directly within filestore, might permit backfill to continue.
>
>
> Yes, that is very likely! (...unless there are a bunch of other objects
> with the same issue.)
>
> I'm not immediately familiar with the crash asserts you're seeing, but
> it certainly looks like somehow the object data didn't quite get stored
> correctly as the metadata understands it. Perhaps a write got
> lost/missed on m+1 of the PG shards, setting the
> osd_find_best_info_ignore_history_les caused it to try and recover from
> what it had rather than following normal recovery procedures, and now
> it's not working.
> -Greg


--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux