Re: Repairing PG inconsistencies — Ceph Documentation - where's the text?

Stuart Longland <stuartl@xxxxxxxxxxxxxxxxxx> · Sat, 18 May 2019 11:43:03 +1000

On 18/5/19 11:34 am, huang jun wrote:
> Stuart Longland <stuartl@xxxxxxxxxxxxxxxxxx> 于2019年5月18日周六 上午9:26写道：
>>
>> On 16/5/19 8:55 pm, Stuart Longland wrote:
>>> As this is Bluestore, it's not clear what I should do to resolve that,
>>> so I thought I'd "RTFM" before asking here:
>>> http://docs.ceph.com/docs/luminous/rados/operations/pg-repair/
>>>
>>> Maybe there's a secret hand-shake my web browser doesn't know about or
>>> maybe the page is written in invisible ink, but that page appears blank
>>> to me.
>>
>> Does anyone know why that page shows up blank?  I still have a placement
>> group that is "inconsistent".  (A different one this time, but still!)
>>
> That maybe something wrong in ceph.com, it's a blank page for me.

Ahh okay, so I'm not going crazy … yet. :-)

>> Some pages I've researched suggest going to the OSD's mount-point and
>> moving the offending object away, however Linux kernel 4.19.17 does not
>> have a 'bluestore' driver, so I can't mount the file system to get at
>> the offending object.
>>
>> Running `ceph pg repair <ID>` tells me it has "instructed" the OSD to do
>> a repair.  The OSD shows nothing at all in its logs even acknowledging
>> the command, and the problem persists.  The only log messages I have of
>> the issue are from yesterday:
>>
>>> 2019-05-17 05:59:53.170552 7f009b0be700 -1 log_channel(cluster) log [ERR] : 7.1a shard 3 soid 7:581d78de:::rbd_data.b48c7238e1f29.0000000000001b34:head : candidate had a read error
>>> 2019-05-17 07:07:20.723999 7f009b0be700 -1 log_channel(cluster) log [ERR] : 7.1a shard 3 soid 7:5b335293:::rbd_data.8c9e1238e1f29.0000000000001438:head : candidate had a read error
>>> 2019-05-17 07:29:16.537539 7f009b0be700 -1 log_channel(cluster) log [ERR] : 7.1a deep-scrub 0 missing, 2 inconsistent objects
>>> 2019-05-17 07:29:16.537557 7f009b0be700 -1 log_channel(cluster) log [ERR] : 7.1a deep-scrub 2 errors
>>
>> … not from just now when I issued the command.  Why is my `ceph pg
>> repair` command being ignored?
> ceph pg repair will let pg do scrub and repair the inconsistent
> do you still see this warning messages after 'pg repair'?

Yes, I've been running `ceph pg repair 7.1a` repeatedly for the past 4
hours.  No new log messages, and still `ceph health detail` shows this:

> carbon ~ # ceph pg repair 7.1a
> instructing pg 7.1a on osd.2 to repair
> carbon ~ # ceph health detail
> HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 2 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
>     pg 7.1a is active+clean+inconsistent, acting [2,3]

I've also tried `ceph pg deep-scrub 7.1a` to no effect.

I may shut the cluster down later to do some power infrastructure work
(need to add a new power distribution box to power two new nodes) and
possibly even install a new 48-port Ethernet switch but right now, I'd
like to try and get my storage cluster back to health.

Regards,
-- 
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
  ...it's backed up on a tape somewhere.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com