Re: fixing another remapped+incomplete EC 4+2 pg

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks Greg,

This did get resolved though I'm not 100% certain why!

For one of the suspect shards which caused crash on backfill, I attempted to delete the associated via s3, late last week. I then examined the filestore OSDs and the file shards were still present... maybe for an hour following (after which I stopped looking).

I left the cluster set to nobackfill over the weekend, during which time all osds kept running; then on Monday morning re-enabled backfill. I expected the osd to crash again. after which I could look into moving or deleting the implicated backfill shards out of the way. Instead of which it happily backfilled its way to cleanliness.

I suppose it's possible the shards got deleted later in some kind of rgw gc operation, and this could have cleared the problem? Unfortunately I didn't look for them again before re-enabling backfill. I'm not sure if that's how s3 object deletion works - does it make any sense?

The only other thing I did late last week was notice that one of the active osds for the pg seemed very slow to respond - the drive was clearly failing. I was never getting any actual i/o errors at the user or osd level, though it did trigger a 24-hour deathwatch SMART warning a bit later.

I exported the pg shard from the failing osd, and re-imported it to another otherwise-evacuated osd. This was just for data safety; it seems really unlikely this could be causing the other osds in the pg to crash...

Graham

On 10/15/2018 01:44 PM, Gregory Farnum wrote:


On Thu, Oct 11, 2018 at 3:22 PM Graham Allan <gta@xxxxxxx
    As the osd crash implies, setting "nobackfill" appears to let all the
    osds keep running and the pg stays active and can apparently serve data.

    If I track down the object referenced below in the object store, I can
    download it without error via s3... though as I can't generate a
    matching etag, it may well be corrupt.

    Still I do wonder if deleting this object - either via s3, or maybe
    more
    likely directly within filestore, might permit backfill to continue.


Yes, that is very likely! (...unless there are a bunch of other objects with the same issue.)

I'm not immediately familiar with the crash asserts you're seeing, but it certainly looks like somehow the object data didn't quite get stored correctly as the metadata understands it. Perhaps a write got lost/missed on m+1 of the PG shards, setting the osd_find_best_info_ignore_history_les caused it to try and recover from what it had rather than following normal recovery procedures, and now it's not working.
-Greg


--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux