Re: Recover unfound objects from crashed OSD's underlying filesystem

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 17 Feb 2016 15:22:30 -0800

On Wed, Feb 17, 2016 at 3:05 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
> Hello cephers,
> due to an unfortunate sequence of events (disk crashes, network
> problems), we are currently in a situation with one PG that reports
> unfound objects. There is also an OSD which cannot start-up and
> crashes with the following:
>
> 2016-02-17 18:40:01.919546 7fecb0692700 -1 os/FileStore.cc: In
> function 'virtual int FileStore::read(coll_t, const ghobject_t&,
> uint64_t, size_t, ceph::bufferlist&, bool)
> ' thread 7fecb0692700 time 2016-02-17 18:40:01.889980
> os/FileStore.cc: 2650: FAILED assert(allow_eio ||
> !m_filestore_fail_eio || got != -5)
>
> (There is probably a problem with the OSD's underlying disk storage)
>
> By querying the PG that is stuck in
> active+recovering+degraded+remapped state due to the unfound objects,
> I understand that all possible OSDs are probed except for the crashed
> one:
>
> "might_have_unfound": [
>   { "osd": "30",
>    "status": "already probed"},
>   { "osd": "102",
>    "status": "already probed"},
>   { "osd": "104",
>    "status": "osd is down"},
>   { "osd": "105",
>    "status": "already probed"},
>   { "osd": "145",
>     "status": "already probed"}],
>
> so I understand that the crashed OSD may have the latest version of
> the objects. I can also verify that I I can find the 4MB objects in
> the underlying filesystem of the crashed OSD.
>
> By issuing ceph pg 3.5a9 list_missing, I get for all unfound objects,
> information like this:
>
>         { "oid": { "oid":
> "829d5be29cd6e231e7e951ba58ad3d0baf7fba65aad40083cef39bb03d5ec0fd",
>               "key": "",
>               "snapid": -2,
>               "hash": 3880052137,
>               "max": 0,
>               "pool": 3,
>               "namespace": ""},
>           "need": "255658'37078125",
>           "have": "255651'37077081",
>           "locations": []}
>
>
> My question is what is the best solution that I should follow?
> a. Is there any way to export the objects from the crashed OSD's
> filesystem and reimport it to the cluster? How could that be done?

Look at ceph_objecstore_tool. eg,
http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool

> b. If I issue ceph pg {pg-id} mark_unfound_lost revert, should I
> expect that the "have" version of this object (thus an older version
> of the object) will become enabled?

It should, although I gather this sometimes takes some contortions for
reasons I've never worked out.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com