Cannot mount CephFS after irreversible OSD lost

Mykola Dvornik <mykola.dvornik@xxxxxxxxx> · Tue, 17 Nov 2015 11:08:57 +0100

Dear ceph experts,
I've built and administrating 12 OSD ceph cluster (spanning over 3 nodes) with replication count of 2. The ceph version is 

ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

The cluster hosts two pools (data and metadata) that are exported over CephFS.

At some point the OSDs approached 'full' state one of them got corrupted. The easiest solution was to remove/re-add the wiped OSD back. 

It went fine, the cluster was recovering without issues. At the point of only 39 degraded objects left another OSD  corrupted (its peer actually). I was not able to recover it and I have made a hard decision to remove it, wipe and re-add back to the cluster. Since no backups have been made, the data corruption was expected.

To my surprise when all OSDs got back online and cluster started to recover, only one incomplete PG has been reported. I've worked around it by ssh'ing to the node that holds its primary OSDs and then exporting the corrupted pg with 'ceph-objectstore-tool --op export' marking it 'complete' afterwards. Once cluster recovered, I've imported the pg's data back to its primary OSD. The recovery then fully completed and at the moment 'ceph -s' gives me:

    cluster 7972d1e9-2843-41a3-a4e7-9889d9c75850
     health HEALTH_WARN
            1 near full osd(s)
     monmap e1: 1 mons at {000-s-ragnarok=xxx.xxx.xxx.xxx:6789/0}
            election epoch 1, quorum 0 000-s-ragnarok
     mdsmap e9393: 1/1/0 up {0=000-s-ragnarok=up:active}
     osdmap e185363: 12 osds: 12 up, 12 in
      pgmap v5599327: 1024 pgs, 2 pools, 7758 GB data, 22316 kobjects
            15804 GB used, 6540 GB / 22345 GB avail
                1020 active+clean
                   4 active+clean+scrubbing+deep

However when I've brought the mds back online the CephFS cannot be mounted anymore complaining on the client side 'mount error 5 = Input/output error'. Since mds was running just fine without any suspicious messages in its log, I've decided that something happened to its journal and CephFS disaster recovery is needed. I've stopped the mds and tried to make a backup of the journal. UnfortunatelyA, the tool crashed with the following output:

cephfs-journal-tool journal export backup.bin
journal is 1841503004303~12076
*** buffer overflow detected ***: cephfs-journal-tool terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7f175ef12a57]
/lib64/libc.so.6(+0x10bc10)[0x7f175ef10c10]
/lib64/libc.so.6(+0x10b119)[0x7f175ef10119]
/lib64/libc.so.6(_IO_vfprintf+0x2f00)[0x7f175ee4f430]
/lib64/libc.so.6(__vsprintf_chk+0x88)[0x7f175ef101a8]
/lib64/libc.so.6(__sprintf_chk+0x7d)[0x7f175ef100fd]
cephfs-journal-tool(_ZN6Dumper4dumpEPKc+0x630)[0x7f1763374720]
cephfs-journal-tool(_ZN11JournalTool14journal_exportERKSsb+0x294)[0x7f1763357874]
cephfs-journal-tool(_ZN11JournalTool12main_journalERSt6vectorIPKcSaIS2_EE+0x105)[0x7f17633580c5]
cephfs-journal-tool(_ZN11JournalTool4mainERSt6vectorIPKcSaIS2_EE+0x56e)[0x7f17633514de]
cephfs-journal-tool(main+0x1de)[0x7f1763350d4e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f175ee26af5]
cephfs-journal-tool(+0x1ccae9)[0x7f1763356ae9]
...
-3> 2015-11-17 10:43:00.874529 7f174db4b700  1 -- xxx.xxx.xxx.xxx:6802/3019233561 <== osd.9 xxx.xxx.xxx.xxx:6808/13662 1 ==== osd_op_reply(4 200.0006b309 [stat] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 179+0+0 (2303160312 0 0) 0x7f1767c719c0 con 0x7f1767d194a0
...

So I've used rados tool to export the cephfs_metadata pool, and then proceeded with 

cephfs-journal-tool event recover_dentries summary
cephfs-journal-tool journal reset
cephfs-table-tool all reset session
ceph fs reset home --yes-i-really-mean-it

After this manipulation, the cephfs-journal-tool journal export backup.rec worked, but wrote 48 bytes at around 1.8TB offset!

Then I've brought mds back online, but CephFS is still non-mountable.

I've tried to flush the journal with:

ceph daemon mds.000-s-ragnarok flush journal

No luck. Then I've stopped mds and relaunched with

ceph-mds -i 000-s-ragnarok --journal_check 0 --debug_mds=10 --debug_ms=100

It persistently outputs this snippet for a couple of hours:

7faf0bd58700  7 mds.0.cache trim max=100000  cur=17
7faf0bd58700 10 mds.0.cache trim_client_leases
7faf0bd58700  2 mds.0.cache check_memory_usage total 256288, rss 19116, heap 48056, malloc 1791 mmap 0, baseline 48056, buffers 0, 0 / 19 inodes have caps, 0 caps, 0 caps per inode
7faf0bd58700 10 mds.0.log trim 1 / 30 segments, 8 / -1 events, 0 (0) expiring, 0 (0) expired
7faf0bd58700 10 mds.0.log _trim_expired_segments waiting for 1841488226436/1841503004303 to expire
7faf0bd58700 10 mds.0.server find_idle_sessions.  laggy until 0.000000
7faf0bd58700 10 mds.0.locker scatter_tick
7faf0bd58700 10 mds.0.cache find_stale_fragment_freeze
7faf0bd58700 10 mds.0.snap check_osd_map - version unchanged
7faf0b557700 10 mds.beacon.000-s-ragnarok _send up:active seq 12

So it appears to me that even despite 'cephfs-journal-tool journal reset', the journal is not wiped and its corruption blocks CephFS from being mounted.

The output of 'cephfs-journal-tool event get list' is

0x1acc221e68f SUBTREEMAP:  ()
0x1acc221e9ab UPDATE:  (scatter_writebehind)
  stray7
0x1acc221f05e UPDATE:  (scatter_writebehind)
  stray8
0x1acc221f711 UPDATE:  (scatter_writebehind)
  stray7
0x1acc221fdc4 UPDATE:  (scatter_writebehind)
  stray8
0x1acc2220477 UPDATE:  (scatter_writebehind)
  stray9
0x1acc2220b2a UPDATE:  (scatter_writebehind)
  stray9
0x1acc22211dd UPDATE:  (scatter_writebehind)

The output of 'cephfs-journal-tool header get' is

{
    "magic": "ceph fs volume v011",
    "write_pos": 1841503016379,
    "expire_pos": 1841503004303,
    "trimmed_pos": 1841488199680,
    "stream_format": 1,
    "layout": {
        "stripe_unit": 4194304,
        "stripe_count": 1,
        "object_size": 4194304,
        "cas_hash": 0,
        "object_stripe_unit": 0,
        "pg_pool": 2
    }
}

The output of 'cephfs-journal-tool journal inspect' is 

Overall journal integrity: OK

At the moment I am running 'cephfs-data-scan scan_extents cephfs_data'. I guess it won't help me much to back CephFS back online, but might fix some corrupted metadata.

So my questing is how to identify what really blocks CephFS from being mounted. Is it possible to start with a fresh journal by doing ‘fs remove; fs new’ reusing the data pool, and using the rados backup of the metadata pool?    

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com