osd crash: trim_objectcould not find coid

francois@xxxxxxxxxxxxx (Francois Deppierraz) · Fri, 12 Sep 2014 13:41:36 +0200

Hi,

Following-up this issue, I've identified that almost all unfound objects
belongs to a single RBD volume (with the help of the script below).

Now what's the best way to try to recover the filesystem stored on this
RBD volume?

'mark_unfound_lost revert' or 'mark_unfound_lost lost' and then running
fsck?

By the way, I'm also still interested to know whether the procedure I've
tried with ceph_objectstore_tool was correct?

Thanks!

Fran?ois

[1] ceph-list-unfound.sh

#!/bin/sh
for pg in $(ceph health detail | awk '/unfound$/ { print $2; }'); do
        ceph pg $pg list_missing | jq .objects
done | jq -s add | jq '.[] | .oid.oid'

On 11. 09. 14 11:05, Francois Deppierraz wrote:
> Hi Greg,
> 
> An attempt to recover pg 3.3ef by copying it from broken osd.6 to
> working osd.32 resulted in one more broken osd :(
> 
> Here's what was actually done:
> 
> root at storage1:~# ceph pg 3.3ef list_missing | head
> { "offset": { "oid": "",
>       "key": "",
>       "snapid": 0,
>       "hash": 0,
>       "max": 0,
>       "pool": -1,
>       "namespace": ""},
>   "num_missing": 219,
>   "num_unfound": 219,
>   "objects": [
> [...]
> root at storage1:~# ceph pg 3.3ef query
> [...]
>           "might_have_unfound": [
>                 { "osd": 6,
>                   "status": "osd is down"},
>                 { "osd": 19,
>                   "status": "already probed"},
>                 { "osd": 32,
>                   "status": "already probed"},
>                 { "osd": 42,
>                   "status": "already probed"}],
> [...]
> 
> # Exporting pg 3.3ef from broken osd.6
> 
> root at storage2:~# ceph_objectstore_tool --data-path
> /var/lib/ceph/osd/ceph-6/ --journal-path
> /var/lib/ceph/osd/ssd0/6.journal --pgid 3.3ef --op export --file
> ~/backup/osd-6.pg-3.3ef.export
> 
> # Remove an empty pg 3.3ef which was already present on this OSD
> 
> root at storage2:~# service ceph stop osd.32
> root at storage2:~# ceph_objectstore_tool --data-path
> /var/lib/ceph/osd/ceph-32/ --journal-path
> /var/lib/ceph/osd/ssd0/32.journal --pgid 3.3ef --op remove
> 
> # Import pg 3.3ef from dump
> 
> root at storage2:~# ceph_objectstore_tool --data-path
> /var/lib/ceph/osd/ceph-32/ --journal-path
> /var/lib/ceph/osd/ssd0/32.journal --op import --file
> ~/backup/osd-6.pg-3.3ef.export
> root at storage2:~# service ceph start osd.32
> 
>     -1> 2014-09-10 18:53:37.196262 7f13fdd7d780  5 osd.32 pg_epoch:
> 48366 pg[3.3ef(unlocked)] enter Initial
>      0> 2014-09-10 18:53:37.239479 7f13fdd7d780 -1 *** Caught signal
> (Aborted) **
>  in thread 7f13fdd7d780
> 
>  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
>  1: /usr/bin/ceph-osd() [0x8843da]
>  2: (()+0xfcb0) [0x7f13fcfabcb0]
>  3: (gsignal()+0x35) [0x7f13fb98a0d5]
>  4: (abort()+0x17b) [0x7f13fb98d83b]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f13fc2dc69d]
>  6: (()+0xb5846) [0x7f13fc2da846]
>  7: (()+0xb5873) [0x7f13fc2da873]
>  8: (()+0xb596e) [0x7f13fc2da96e]
>  9: /usr/bin/ceph-osd() [0x94b34f]
>  10:
> (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x12c)
> [0x691b6c]
>  11: (PGLog::read_log(ObjectStore*, coll_t, hobject_t, pg_info_t const&,
> std::map<eversion_t, hobject_t, std::less<eversion_t>,
> std::allocator<std::pair<eversion_t const,
>  hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&,
> std::basic_ostringstream<char, std::char_traits<char>,
> std::allocator<char> >&, std::set<std::string, std::less<std::
> string>, std::allocator<std::string> >*)+0x16d4) [0x7d3ef4]
>  12: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x2c1) [0x7951b1]
>  13: (OSD::load_pgs()+0x18f3) [0x61e143]
>  14: (OSD::init()+0x1b9a) [0x62726a]
>  15: (main()+0x1e8d) [0x5d2d0d]
>  16: (__libc_start_main()+0xed) [0x7f13fb97576d]
>  17: /usr/bin/ceph-osd() [0x5d69d9]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> Fortunately it was possible to bring back osd.32 into a working state
> simply be removing this pg.
> 
> root at storage2:~# ceph_objectstore_tool --data-path
> /var/lib/ceph/osd/ceph-32/ --journal-path
> /var/lib/ceph/osd/ssd0/32.journal --pgid 3.3ef --op remove
> 
> Did I miss something from this procedure or does it mean that this pg is
> definitely lost?
> 
> Thanks!
> 
> Fran?ois
> 
> On 09. 09. 14 00:23, Gregory Farnum wrote:
>> On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz
>> <francois at ctrlaltdel.ch> wrote:
>>> Hi Greg,
>>>
>>> Thanks for your support!
>>>
>>> On 08. 09. 14 20:20, Gregory Farnum wrote:
>>>
>>>> The first one is not caused by the same thing as the ticket you
>>>> reference (it was fixed well before emperor), so it appears to be some
>>>> kind of disk corruption.
>>>> The second one is definitely corruption of some kind as it's missing
>>>> an OSDMap it thinks it should have. It's possible that you're running
>>>> into bugs in emperor that were fixed after we stopped doing regular
>>>> support releases of it, but I'm more concerned that you've got disk
>>>> corruption in the stores. What kind of crashes did you see previously;
>>>> are there any relevant messages in dmesg, etc?
>>>
>>> Nothing special in dmesg except probably irrelevant XFS warnings:
>>>
>>> XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
>>
>> Hmm, I'm not sure what the outcome of that could be. Googling for the
>> error message returns this as the first result, though:
>> http://comments.gmane.org/gmane.comp.file-systems.xfs.general/58429
>> Which indicates that it's a real deadlock and capable of messing up
>> your OSDs pretty good.
>>
>>>
>>> All logs from before the disaster are still there, do you have any
>>> advise on what would be relevant?
>>>
>>>> Given these issues, you might be best off identifying exactly which
>>>> PGs are missing, carefully copying them to working OSDs (use the osd
>>>> store tool), and killing these OSDs. Do lots of backups at each
>>>> stage...
>>>
>>> This sounds scary, I'll keep fingers crossed and will do a bunch of
>>> backups. There are 17 pg with missing objects.
>>>
>>> What do you exactly mean by the osd store tool? Is it the
>>> 'ceph_filestore_tool' binary?
>>
>> Yeah, that one.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>