Re: unable to repair PG

Luis Periquito <periquito@xxxxxxxxx> · Mon, 15 Dec 2014 09:26:02 +0000

Just to update this issue.

I stopped OSD.6, removed the PG from disk, and restarted it. Ceph rebuilt the object and it went to HEALTH_OK.

During the weekend the disk for OSD.6 started giving smart errors and will be replaced.

Thanks for your help Greg. I've opened a bug report in the tracker.

On Fri, Dec 12, 2014 at 9:53 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:[Re-adding the list]

Yeah, so "shard 6" means that it's osd.6 which has the bad data.

Apparently pg repair doesn't recover from this class of failures; if

you could file a bug that would be appreciated.

But anyway, if you delete the object in question from OSD 6 and run a

repair on the pg again it should recover just fine.

-Greg

On Fri, Dec 12, 2014 at 1:45 PM, Luis Periquito <periquito@xxxxxxxxx> wrote:

> Running firefly 0.80.7 with a replicated pools, with 4 copies.

>

> On 12 Dec 2014 19:20, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote:

>>

>> What version of Ceph are you running? Is this a replicated or

>> erasure-coded pool?

>>

>> On Fri, Dec 12, 2014 at 1:11 AM, Luis Periquito <periquito@xxxxxxxxx>

>> wrote:

>> > Hi Greg,

>> >

>> > thanks for your help. It's always highly appreciated. :)

>> >

>> > On Thu, Dec 11, 2014 at 6:41 PM, Gregory Farnum <greg@xxxxxxxxxxx>

>> > wrote:

>> >>

>> >> On Thu, Dec 11, 2014 at 2:57 AM, Luis Periquito <periquito@xxxxxxxxx>

>> >> wrote:

>> >> > Hi,

>> >> >

>> >> > I've stopped OSD.16, removed the PG from the local filesystem and

>> >> > started

>> >> > the OSD again. After ceph rebuilt the PG in the removed OSD I ran a

>> >> > deep-scrub and the PG is still inconsistent.

>> >>

>> >> What led you to remove it from osd 16? Is that the one hosting the log

>> >> you snipped from? Is osd 16 the one hosting shard 6 of that PG, or was

>> >> it the primary?

>> >

>> > OSD 16 is both the primary for this PG and the one that has the snipped

>> > log.

>> > The other 3 OSDs has any mention of this PG in their logs. Just some

>> > messages about slow requests and the backfill when I removed the object.

>> > Actually it came from OSD.6 - currently we don't have OSD.3.

>> >

>> > this is the output of the pg dump for this PG

>> > 9.180    25614    0    0    0    23306482348    3001    3001

>> > active+clean+inconsistent    2014-12-10 17:29:01.937929    40242'1108124

>> > 40242:23305321    [16,10,27,6]    16    [16,10,27,6]16    40242'1071363

>> > 2014-12-10 17:29:01.937881    40242'1071363    2014-12-10

>> > 17:29:01.937881

>> >

>> >>

>> >> Anyway, the message means that shard 6 (which I think is the seventh

>> >> OSD in the list) of PG 9.180 is missing a bunch of xattrs on object

>> >> 370cbf80/29145.4_xxx/head//9. I'm actually a little surprised it

>> >> didn't crash if it's missing the "_" attr....

>> >> -Greg

>> >

>> >

>> > Any idea on how to fix it?

>> >

>> >>

>> >>

>> >> >

>> >> > I'm running out of ideas on trying to solve this. Does this mean that

>> >> > all

>> >> > copies of the object should also be inconsistent? Should I just try

>> >> > to

>> >> > figure which object/bucket this belongs to and delete it/copy it

>> >> > again

>> >> > to

>> >> > the ceph cluster?

>> >> >

>> >> > Also, do you know what the error message means? is it just some sort

>> >> > of

>> >> > metadata for this object that isn't correct, not the object itself?

>> >> >

>> >> > On Wed, Dec 10, 2014 at 11:11 AM, Luis Periquito

>> >> > <periquito@xxxxxxxxx>

>> >> > wrote:

>> >> >>

>> >> >> Hi,

>> >> >>

>> >> >> In the last few days this PG (pool is .rgw.buckets) has been in

>> >> >> error

>> >> >> after running the scrub process.

>> >> >>

>> >> >> After getting the error, and trying to see what may be the issue

>> >> >> (and

>> >> >> finding none), I've just issued a ceph repair followed by a ceph

>> >> >> deep-scrub.

>> >> >> However it doesn't seem to have fixed the issue and it still

>> >> >> remains.

>> >> >>

>> >> >> The relevant log from the OSD is as follows.

>> >> >>

>> >> >> 2014-12-10 09:38:09.348110 7f8f618be700  0 log [ERR] : 9.180

>> >> >> deep-scrub

>> >> >> 0

>> >> >> missing, 1 inconsistent objects

>> >> >> 2014-12-10 09:38:09.348116 7f8f618be700  0 log [ERR] : 9.180

>> >> >> deep-scrub

>> >> >> 1

>> >> >> errors

>> >> >> 2014-12-10 10:13:15.922065 7f8f618be700  0 log [INF] : 9.180 repair

>> >> >> ok,

>> >> >> 0

>> >> >> fixed

>> >> >> 2014-12-10 10:55:27.556358 7f8f618be700  0 log [ERR] : 9.180 shard

>> >> >> 6:

>> >> >> soid

>> >> >> 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr

>> >> >> _user.rgw.acl,

>> >> >> missing attr _user.rgw.content_type, missing attr _user.rgw.etag,

>> >> >> missing

>> >> >> attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing attr

>> >> >> _user.rgw.x-amz-meta-md5sum, missing attr _user.rgw.x-amz-meta-stat,

>> >> >> missing

>> >> >> attr snapset

>> >> >> 2014-12-10 10:56:50.597952 7f8f618be700  0 log [ERR] : 9.180

>> >> >> deep-scrub

>> >> >> 0

>> >> >> missing, 1 inconsistent objects

>> >> >> 2014-12-10 10:56:50.597957 7f8f618be700  0 log [ERR] : 9.180

>> >> >> deep-scrub

>> >> >> 1

>> >> >> errors

>> >> >>

>> >> >> I'm running version firefly 0.80.7.

>> >> >

>> >> >

>> >> >

>> >> > _______________________________________________

>> >> > ceph-users mailing list

>> >> > ceph-users@xxxxxxxxxxxxxx

>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >> >

>> >

>> >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com