Re: PG inconsistency

Dan van der Ster <daniel.vanderster@xxxxxxx> · Thu, 6 Nov 2014 13:53:43 +0000

IIRC, the EIO we had also correlated with a SMART status that showed the disk was bad enough for a warranty replacement -- so yes, I replaced the disk in these cases.
Cheers, Dan

On Thu Nov 06 2014 at 2:44:08 PM GuangYang <yguang11@xxxxxxxxxxx> wrote:
Thanks Dan. By "killed/formatted/replaced the OSD", did you replace the disk? Not an filesystem expert here, but would like to understand the underlying what happened behind the EIO and does that reveal something (e.g. hardware issue).

In our case, we are using 6TB drive so that there are lot of data to migrate and as backfilling/recovering bring latency increasing, we hope to avoid that as much as we can..

Thanks,

Guang

________________________________

> From: daniel.vanderster@xxxxxxx

> Date: Thu, 6 Nov 2014 13:36:46 +0000

> Subject: Re: PG inconsistency

> To: yguang11@xxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx

>

> Hi,

> I've only ever seen (1), EIO to read a file. In this case I've always

> just killed / formatted / replaced that OSD completely -- that moves

> the PG to a new master and the new replication "fixes" the

> inconsistency. This way, I've never had to pg repair. I don't know if

> this is a best or even good practise, but it works for us.

> Cheers, Dan

>

> On Thu Nov 06 2014 at 2:24:32 PM GuangYang

> <yguang11@xxxxxxxxxxx<mailto:yguang11@xxxxxxxxxxx>> wrote:

> Hello Cephers,

> Recently we observed a couple of inconsistencies in our Ceph cluster,

> there were two major patterns leading to inconsistency as I observed:

> 1) EIO to read the file, 2) the digest is inconsistent (for EC) even

> there is no read error).

>

> While ceph has built-in tool sets to repair the inconsistencies, I also

> would like to check with the community in terms of what is the best

> ways to handle such issues (e.g. should we run fsck / xfs_repair when

> such issue happens).

>

> In more details, I have the following questions:

> 1. When there is inconsistency detected, what is the chance there is

> some hardware issues which need to be repaired physically, or should I

> run some disk/filesystem tools to further check?

> 2. Should we use fsck / xfs_repair to fix the inconsistencies, or

> should we solely relay on Ceph's repair tool sets?

>

> It would be great to hear you experience and suggestions.

>

> BTW, we are using XFS in the cluster.

>

> Thanks,

> Guang

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com