Re: Cannot mount CephFS after irreversible OSD lost

John Spray <jspray@xxxxxxxxxx> · Tue, 17 Nov 2015 11:05:20 +0000

On Tue, Nov 17, 2015 at 10:08 AM, Mykola Dvornik
<mykola.dvornik@xxxxxxxxx> wrote:
> However when I've brought the mds back online the CephFS cannot be mounted
> anymore complaining on the client side 'mount error 5 = Input/output error'.
> Since mds was running just fine without any suspicious messages in its log,
> I've decided that something happened to its journal and CephFS disaster
> recovery is needed. I've stopped the mds and tried to make a backup of the
> journal. UnfortunatelyA, the tool crashed with the following output:

A journal corruption would be more likely to make the MDS fail to go
active.  It sounds like on your system the MDS is making it into an
active state, and staying up through a client mount, but the client
itself is failing to mount.  You should investigate the client mount
failure more closely to work out what's going wrong.

Run the MDS with "debug mds = 20" and a fuse client with "debug client
= 20", to gather evidence of how/why the client mount is actually
failing.

> cephfs-journal-tool journal export backup.bin
> journal is 1841503004303~12076
> *** buffer overflow detected ***: cephfs-journal-tool terminated
> ======= Backtrace: =========
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7f175ef12a57]
> /lib64/libc.so.6(+0x10bc10)[0x7f175ef10c10]
> /lib64/libc.so.6(+0x10b119)[0x7f175ef10119]
> /lib64/libc.so.6(_IO_vfprintf+0x2f00)[0x7f175ee4f430]
> /lib64/libc.so.6(__vsprintf_chk+0x88)[0x7f175ef101a8]
> /lib64/libc.so.6(__sprintf_chk+0x7d)[0x7f175ef100fd]
> cephfs-journal-tool(_ZN6Dumper4dumpEPKc+0x630)[0x7f1763374720]
> cephfs-journal-tool(_ZN11JournalTool14journal_exportERKSsb+0x294)[0x7f1763357874]
> cephfs-journal-tool(_ZN11JournalTool12main_journalERSt6vectorIPKcSaIS2_EE+0x105)[0x7f17633580c5]
> cephfs-journal-tool(_ZN11JournalTool4mainERSt6vectorIPKcSaIS2_EE+0x56e)[0x7f17633514de]
> cephfs-journal-tool(main+0x1de)[0x7f1763350d4e]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f175ee26af5]
> cephfs-journal-tool(+0x1ccae9)[0x7f1763356ae9]
> ...
> -3> 2015-11-17 10:43:00.874529 7f174db4b700  1 --
> xxx.xxx.xxx.xxx:6802/3019233561 <== osd.9 xxx.xxx.xxx.xxx:6808/13662 1 ====
> osd_op_reply(4 200.0006b309 [stat] v0'0 uv0 ack = -2 ((2) No such file or
> directory)) v6 ==== 179+0+0 (2303160312 0 0) 0x7f1767c719c0 con
> 0x7f1767d194a0

Oops, that's a bug.  http://tracker.ceph.com/issues/13816

> ...
>
> So I've used rados tool to export the cephfs_metadata pool, and then
> proceeded with
>
> cephfs-journal-tool event recover_dentries summary
> cephfs-journal-tool journal reset
> cephfs-table-tool all reset session
> ceph fs reset home --yes-i-really-mean-it
>
> After this manipulation, the cephfs-journal-tool journal export backup.rec
> worked, but wrote 48 bytes at around 1.8TB offset!

That's expected behaviour -- the journal has been reset to nothing,
but the write position is still where it was (the export function
writes a sparse file with location based on the journal offset in
ceph).

> Then I've brought mds back online, but CephFS is still non-mountable.
> I've tried to flush the journal with:
>
> ceph daemon mds.000-s-ragnarok flush journal

Yep, that's not going to do anything because there's little or nothing
in the journal after its reset.

> No luck. Then I've stopped mds and relaunched with
>
> ceph-mds -i 000-s-ragnarok --journal_check 0 --debug_mds=10 --debug_ms=100
>
> It persistently outputs this snippet for a couple of hours:
>
> 7faf0bd58700  7 mds.0.cache trim max=100000  cur=17
> 7faf0bd58700 10 mds.0.cache trim_client_leases
> 7faf0bd58700  2 mds.0.cache check_memory_usage total 256288, rss 19116, heap
> 48056, malloc 1791 mmap 0, baseline 48056, buffers 0, 0 / 19 inodes have
> caps, 0 caps, 0 caps per inode
> 7faf0bd58700 10 mds.0.log trim 1 / 30 segments, 8 / -1 events, 0 (0)
> expiring, 0 (0) expired
> 7faf0bd58700 10 mds.0.log _trim_expired_segments waiting for
> 1841488226436/1841503004303 to expire
> 7faf0bd58700 10 mds.0.server find_idle_sessions.  laggy until 0.000000
> 7faf0bd58700 10 mds.0.locker scatter_tick
> 7faf0bd58700 10 mds.0.cache find_stale_fragment_freeze
> 7faf0bd58700 10 mds.0.snap check_osd_map - version unchanged
> 7faf0b557700 10 mds.beacon.000-s-ragnarok _send up:active seq 12

The key bit of MDS debug output is the part from where a client tries to mount.

> So it appears to me that even despite 'cephfs-journal-tool journal reset',
> the journal is not wiped and its corruption blocks CephFS from being
> mounted.

Nope, I don't think that's the right conclusion: we don't know what's
broken yet.

> At the moment I am running 'cephfs-data-scan scan_extents cephfs_data'. I
> guess it won't help me much to back CephFS back online, but might fix some
> corrupted metadata.

It could help (although it's the subsequent scan_inodes step that
actually updates the metadata pool), but only if the reason the
clients aren't mounting is because there is missing metadata for e.g.
the root inode that is causing the client mount failure.

> So my questing is how to identify what really blocks CephFS from being
> mounted. Is it possible to start with a fresh journal by doing ‘fs remove;
> fs new’ reusing the data pool, and using the rados backup of the metadata
> pool?

I would expect that restoring from your backup to bring you back to
the initial state of the system (i.e. before you tried any repairs):
whatever was causing issues in the metadata pool would still be there
because it will be in your backup too.

So: the key thing here is to capture whatever error is happening on
the MDS side at the moment the client tries to mount.  Then we know
what's broken and can talk about how to repair it.

If you can share your initial metadata pool dump (i.e. if it's not
ridiculously large and/or confidential) then I can inspect it and
possibly make use of it as a test case.

John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com