Re: PG inconsistency

GuangYang <yguang11@xxxxxxxxxxx> · Mon, 10 Nov 2014 03:25:55 +0000

Thanks Sage!

----------------------------------------
> Date: Fri, 7 Nov 2014 02:19:06 -0800
> From: sage@xxxxxxxxxxxx
> To: yguang11@xxxxxxxxxxx
> CC: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
> Subject: Re: PG inconsistency
>
> On Thu, 6 Nov 2014, GuangYang wrote:
>> Hello Cephers,
>> Recently we observed a couple of inconsistencies in our Ceph cluster,
>> there were two major patterns leading to inconsistency as I observed: 1)
>> EIO to read the file, 2) the digest is inconsistent (for EC) even there
>> is no read error).
>>
>> While ceph has built-in tool sets to repair the inconsistencies, I also
>> would like to check with the community in terms of what is the best ways
>> to handle such issues (e.g. should we run fsck / xfs_repair when such
>> issue happens).
>>
>> In more details, I have the following questions:
>> 1. When there is inconsistency detected, what is the chance there is
>> some hardware issues which need to be repaired physically, or should I
>> run some disk/filesystem tools to further check?
>
> I'm not really an operator so I'm not as familiar with these tools as I
> should be :(, but I suspect the prodent route is to check the SMART info
> on the disk, and/or trigger a scrub of everything else on the OSD (ceph
> osd scrub N). For DreamObjects, I think they usually just fail the OSD
> once it starts throwing bad sectors (most of the hardware is already
> reasonably aged).
Google's data also shows the strong correlation between scrub error (especially several SMART parameters) and disk failure - https://www.usenix.org/legacy/event/fast07/tech/full_papers/pinheiro/pinheiro.pdf.
>
>> 2. Should we use fsck / xfs_repair to fix the inconsistencies, or should
>> we solely relay on Ceph's repair tool sets?
>
> That might not be a bad idea, but I would urge caution if xfs_repair finds
> any issues or makes any changes, as subtle changes to the fs contents can
> confuse ceph-osd. At an absolute minimum, do a full scrub after, but
> even better would be to fail the OSD.
>
> (FWIW I think we should document a recommended "safe" process for
> failing/replacing an OSD that takes the suspect data offline but waits for
> the cluster to heal before destroying any data. Simply marking the OSD
> out will work, but then when a fresh drive is added there will be a second
> repair/rebalance event, which isn't ideal.)
Yeah that would be very helpful, I think the first decision to make is to whether should we replace the disk, in our clusters, there is data corruption (EIO) along with SMART warnings, which is an indicator of bad disk, meanwhile, we also observed xattr is lost (http://tracker.ceph.com/issues/10018) without any SMART warnings, after talking to Sam, we suspected it might be due to unexpected host rebooting (or mis-configured RAID controller), in which case we properly no need to replace the disk but only repair by ceph.

In terms of disk replacement, to avoid migrating data back and forth, are the below two approaches reasonable?
 1. Keep the OSD in and do an ad-hoc disk replacement and provision a new OSD (so that keep the OSD id as the same), and then trigger data migration. In this way the data migration only happens once, however, it does require operators to replace the disk very fast.
 2. Move the data on the broken disk to a new disk completely and use Ceph to repair bad objects.

Thanks,
Guang

>
> sage
>
>>
>> It would be great to hear you experience and suggestions.
>>
>> BTW, we are using XFS in the cluster.
>>
>> Thanks,
>> Guang N????y????b?????v?????{.n??????z??ay????????j???f????????????????:+v??????????zZ+??????"?!?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com