Ceph can't seem to forget

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Thu, 7 Aug 2014 16:33:56 -0700

For your RDB volumes, you've lost random 4MiB chunks from your virtual
disks.  Think of it as unrecoverable bad sectors on the HDD.  It was only a
few unfound objects though (ceph status said 23 out of 5128982).  You can
probably recovery from that.

I'd fsck all of the volumes, and perform any application level checks for
anything high level (database tests for MySQL, stuff like that).

If you still have the list of unfound objects, you might be able to trace
it back to the specific RDB volume.  That would give you a short list of
volumes to check, instead of doing them all.

On Thu, Aug 7, 2014 at 3:54 PM, Sean Sullivan <lookcrabs at gmail.com> wrote:

> Thanks craig! I think I got it back up. The odd thing is that only 2 of
> the pgs using the osds on the downed nodes were corrupted.
>
> I ended up forcing all of the osds in the pool groups down, rebooting the
> hosts. Then restarting the osds and bringing them back up to get it
> working.
>
> I had previously rebooted the osds in the pgs but something must have been
> stuck.
>
> Now I am seeing corrupt data like you mentioned and am beginning to
> question the integrity of the pool.
>
> So far the cinder volume for our main login node had some corruption but
> no complaints so far. Repaired without issue.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140807/a1b83b0f/attachment.htm>