Recover unfound objects from crashed OSD's underlying filesystem

Kostis Fardelas <dante1234@xxxxxxxxx> · Thu, 18 Feb 2016 01:05:24 +0200

Hello cephers,
due to an unfortunate sequence of events (disk crashes, network
problems), we are currently in a situation with one PG that reports
unfound objects. There is also an OSD which cannot start-up and
crashes with the following:

2016-02-17 18:40:01.919546 7fecb0692700 -1 os/FileStore.cc: In
function 'virtual int FileStore::read(coll_t, const ghobject_t&,
uint64_t, size_t, ceph::bufferlist&, bool)
' thread 7fecb0692700 time 2016-02-17 18:40:01.889980
os/FileStore.cc: 2650: FAILED assert(allow_eio ||
!m_filestore_fail_eio || got != -5)

(There is probably a problem with the OSD's underlying disk storage)

By querying the PG that is stuck in
active+recovering+degraded+remapped state due to the unfound objects,
I understand that all possible OSDs are probed except for the crashed
one:

"might_have_unfound": [
  { "osd": "30",
   "status": "already probed"},
  { "osd": "102",
   "status": "already probed"},
  { "osd": "104",
   "status": "osd is down"},
  { "osd": "105",
   "status": "already probed"},
  { "osd": "145",
    "status": "already probed"}],

so I understand that the crashed OSD may have the latest version of
the objects. I can also verify that I I can find the 4MB objects in
the underlying filesystem of the crashed OSD.

By issuing ceph pg 3.5a9 list_missing, I get for all unfound objects,
information like this:

        { "oid": { "oid":
"829d5be29cd6e231e7e951ba58ad3d0baf7fba65aad40083cef39bb03d5ec0fd",
              "key": "",
              "snapid": -2,
              "hash": 3880052137,
              "max": 0,
              "pool": 3,
              "namespace": ""},
          "need": "255658'37078125",
          "have": "255651'37077081",
          "locations": []}

My question is what is the best solution that I should follow?
a. Is there any way to export the objects from the crashed OSD's
filesystem and reimport it to the cluster? How could that be done?
b. If I issue ceph pg {pg-id} mark_unfound_lost revert, should I
expect that the "have" version of this object (thus an older version
of the object) will become enabled?

Best regards,
Kostis
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com