Issue with cephfs

"LeeQ @ BitBahn.io" <leeq@xxxxxxxxxx> · 26 Oct 2019 17:59:17 -0400

We have been using a cephfs pool to store machine data to, the data is not overly critical at this time but.

Its got to around 8TB and we started to see kernel panics with the hosts that had the mounts in place.

Now when try to start the MDS's they cycle through, Active, Replay, ClientReplay about 10 times and then just fail in a active(laggy)state.

So I delete the MDS's

(docker-croit)@us-croit-enc-deploy01 ~ $ ceph fs dump
dumped fsmap epoch 5307
e5307
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'cephfs' (1)
fs_name cephfs
epoch   5307
flags   12
created 2019-10-26 20:43:02.087584
modified        2019-10-26 21:35:17.285598
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
min_compat_client       -1 (unspecified)
last_failure    0
last_failure_osd_epoch  2122066
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=267576193}
failed
damaged
stopped
data_pools      [5,14]
metadata_pool   3
inline_data     disabled
balancer
standby_count_wanted    1
267576193:      v1:100.129.255.186:6800/1355970155 'us-ceph-enc-svc02' mds.0.5301 up:active seq 16 laggy since 2019-10-26 21:12:08.027863

Looks ok.

Then run

(docker-croit)@us-croit-enc-deploy01 ~ $ ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data us_enc_datarefuge_001a ]
(docker-croit)@us-croit-enc-deploy01 ~ $ cephfs-journal-tool --rank cephfs:all journal export lee.bak
journal is 523855986688~99768
wrote 99768 bytes at offset 523855986688 to lee.bak
NOTE: this is a _sparse_ file; you can
        $ tar cSzf lee.bak.tgz lee.bak
      to efficiently compress it while preserving sparseness.
(docker-croit)@us-croit-enc-deploy01 ~ $ cephfs-journal-tool --rank cephfs:all event recover_dentries summary
Events by type:
  RESETJOURNAL: 1
  SESSION: 363
  SESSIONS: 17
  UPDATE: 14
Errors: 0
(docker-croit)@us-croit-enc-deploy01 ~ $ cephfs-journal-tool --rank cephfs:all journal reset
old journal was 523855986688~99768
new journal start will be 523860180992 (4094536 bytes past old end)
writing journal head
writing EResetJournal entry
done
(docker-croit)@us-croit-enc-deploy01 ~ $ cephfs-table-tool all reset session
{
    "0": {
        "data": {},
        "result": 0
    }
}

(docker-croit)@us-croit-enc-deploy01 ~ $ cephfs-table-tool all reset snap
{
    "result": 0
}

(docker-croit)@us-croit-enc-deploy01 ~ $ cephfs-table-tool all reset inode
{
    "0": {
        "data": {},
        "result": 0
    }
}

Re add the MDS's and we go back round in a circle.

Am I missing something? do I need to drop the metadata and recreate it maybe? If it comes to it I can drop all the data and start over, but don't really want to.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx