Re: unable to repair PG

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Just to update this issue.

I stopped OSD.6, removed the PG from disk, and restarted it. Ceph rebuilt the object and it went to HEALTH_OK.

During the weekend the disk for OSD.6 started giving smart errors and will be replaced.

Thanks for your help Greg. I've opened a bug report in the tracker.

On Fri, Dec 12, 2014 at 9:53 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
[Re-adding the list]

Yeah, so "shard 6" means that it's osd.6 which has the bad data.
Apparently pg repair doesn't recover from this class of failures; if
you could file a bug that would be appreciated.
But anyway, if you delete the object in question from OSD 6 and run a
repair on the pg again it should recover just fine.
-Greg

On Fri, Dec 12, 2014 at 1:45 PM, Luis Periquito <periquito@xxxxxxxxx> wrote:
> Running firefly 0.80.7 with a replicated pools, with 4 copies.
>
> On 12 Dec 2014 19:20, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote:
>>
>> What version of Ceph are you running? Is this a replicated or
>> erasure-coded pool?
>>
>> On Fri, Dec 12, 2014 at 1:11 AM, Luis Periquito <periquito@xxxxxxxxx>
>> wrote:
>> > Hi Greg,
>> >
>> > thanks for your help. It's always highly appreciated. :)
>> >
>> > On Thu, Dec 11, 2014 at 6:41 PM, Gregory Farnum <greg@xxxxxxxxxxx>
>> > wrote:
>> >>
>> >> On Thu, Dec 11, 2014 at 2:57 AM, Luis Periquito <periquito@xxxxxxxxx>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I've stopped OSD.16, removed the PG from the local filesystem and
>> >> > started
>> >> > the OSD again. After ceph rebuilt the PG in the removed OSD I ran a
>> >> > deep-scrub and the PG is still inconsistent.
>> >>
>> >> What led you to remove it from osd 16? Is that the one hosting the log
>> >> you snipped from? Is osd 16 the one hosting shard 6 of that PG, or was
>> >> it the primary?
>> >
>> > OSD 16 is both the primary for this PG and the one that has the snipped
>> > log.
>> > The other 3 OSDs has any mention of this PG in their logs. Just some
>> > messages about slow requests and the backfill when I removed the object.
>> > Actually it came from OSD.6 - currently we don't have OSD.3.
>> >
>> > this is the output of the pg dump for this PG
>> > 9.180    25614    0    0    0    23306482348    3001    3001
>> > active+clean+inconsistent    2014-12-10 17:29:01.937929    40242'1108124
>> > 40242:23305321    [16,10,27,6]    16    [16,10,27,6]16    40242'1071363
>> > 2014-12-10 17:29:01.937881    40242'1071363    2014-12-10
>> > 17:29:01.937881
>> >
>> >>
>> >> Anyway, the message means that shard 6 (which I think is the seventh
>> >> OSD in the list) of PG 9.180 is missing a bunch of xattrs on object
>> >> 370cbf80/29145.4_xxx/head//9. I'm actually a little surprised it
>> >> didn't crash if it's missing the "_" attr....
>> >> -Greg
>> >
>> >
>> > Any idea on how to fix it?
>> >
>> >>
>> >>
>> >> >
>> >> > I'm running out of ideas on trying to solve this. Does this mean that
>> >> > all
>> >> > copies of the object should also be inconsistent? Should I just try
>> >> > to
>> >> > figure which object/bucket this belongs to and delete it/copy it
>> >> > again
>> >> > to
>> >> > the ceph cluster?
>> >> >
>> >> > Also, do you know what the error message means? is it just some sort
>> >> > of
>> >> > metadata for this object that isn't correct, not the object itself?
>> >> >
>> >> > On Wed, Dec 10, 2014 at 11:11 AM, Luis Periquito
>> >> > <periquito@xxxxxxxxx>
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> In the last few days this PG (pool is .rgw.buckets) has been in
>> >> >> error
>> >> >> after running the scrub process.
>> >> >>
>> >> >> After getting the error, and trying to see what may be the issue
>> >> >> (and
>> >> >> finding none), I've just issued a ceph repair followed by a ceph
>> >> >> deep-scrub.
>> >> >> However it doesn't seem to have fixed the issue and it still
>> >> >> remains.
>> >> >>
>> >> >> The relevant log from the OSD is as follows.
>> >> >>
>> >> >> 2014-12-10 09:38:09.348110 7f8f618be700  0 log [ERR] : 9.180
>> >> >> deep-scrub
>> >> >> 0
>> >> >> missing, 1 inconsistent objects
>> >> >> 2014-12-10 09:38:09.348116 7f8f618be700  0 log [ERR] : 9.180
>> >> >> deep-scrub
>> >> >> 1
>> >> >> errors
>> >> >> 2014-12-10 10:13:15.922065 7f8f618be700  0 log [INF] : 9.180 repair
>> >> >> ok,
>> >> >> 0
>> >> >> fixed
>> >> >> 2014-12-10 10:55:27.556358 7f8f618be700  0 log [ERR] : 9.180 shard
>> >> >> 6:
>> >> >> soid
>> >> >> 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr
>> >> >> _user.rgw.acl,
>> >> >> missing attr _user.rgw.content_type, missing attr _user.rgw.etag,
>> >> >> missing
>> >> >> attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing attr
>> >> >> _user.rgw.x-amz-meta-md5sum, missing attr _user.rgw.x-amz-meta-stat,
>> >> >> missing
>> >> >> attr snapset
>> >> >> 2014-12-10 10:56:50.597952 7f8f618be700  0 log [ERR] : 9.180
>> >> >> deep-scrub
>> >> >> 0
>> >> >> missing, 1 inconsistent objects
>> >> >> 2014-12-10 10:56:50.597957 7f8f618be700  0 log [ERR] : 9.180
>> >> >> deep-scrub
>> >> >> 1
>> >> >> errors
>> >> >>
>> >> >> I'm running version firefly 0.80.7.
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > ceph-users@xxxxxxxxxxxxxx
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >
>> >
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux