Re: ceph mds crashing constantly : ceph_assert fail … prepare_new_inode

"Yan, Zheng" <ukernel@xxxxxxxxx> · Mon, 13 Aug 2018 08:22:50 +0800

On Sat, Aug 11, 2018 at 1:21 PM Amit Handa <amit.handa@xxxxxxxxx> wrote:
>
> Thanks for the response, gregory.
>
> We need to support a couple of production services we have migrated to ceph. So we are in a bit of soup.
>
> cluster is as follows:
> ```
> ceph osd tree
> ID  CLASS WEIGHT   TYPE NAME       STATUS REWEIGHT PRI-AFF
>  -1       11.06848 root default
>  -7        5.45799     host master
>   5   hdd  5.45799         osd.5       up  1.00000 1.00000
>  -5        1.81940     host node2
>   7   hdd  1.81940         osd.7       up  1.00000 1.00000
>  -3        1.81940     host node3
>   8   hdd  1.81940         osd.8       up  1.00000 1.00000
>  -9        1.81940     host node4
>   6   hdd  1.81940         osd.6       up  1.00000 1.00000
> -11        0.15230     host node5
>   9   hdd  0.15230         osd.9       up  1.00000 1.00000
> ```
>
> We have installed ceph cluster and kubernetes cluster on the same nodes (centos 7).
> We were facing low perf from ceph cluster ~10.5MB/S ```dd if=/dev/zero | of=./here bs=1M count=1024 oflag=direct```
> So, we were in the process of adding additional NIC to each node. rebooting each one by one, ensuring rebooted node works well and proceeding further.
> After every few(a couple) reboots of nodes, mds would go down. (report data damage).
> We would following the disaster recovery link and it ll be merry.
>
> a couple of days since, mds hasnt come up. disaster recovery doesnt work no more.
>

try following step

1. umount all cephfs client first (kill ceph-fuse, umount -f kernel mount)
2. start mds, run 'ceph daemon mds.x journal'
3. stop mds,
4. run "cephfs-data-scan scan_links"
5. use "cephfs-table-tool cephfs:0 take_inos ..." to take some free
inode numbers (10k should be enough)

> cluster conf:
> ```
> [global]
> fsid = 2ed909ef-e3d7-4081-b01a-d04d12a1155d
> mon_initial_members = master, node3, node2
> mon_host = 10.10.73.45,10.10.73.44,10.10.73.43
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
>
> public_network= 10.10.73.0/24
> osd pool default size = 2  # Write an object 3 times.
> osd pool default min size = 2
> mon allow pool delete = true
>
> cluster network = 10.10.73.0/24
>
> max open files = 131072
>
> [mon]
> mon data = /var/lib/ceph/mon/ceph-$id
>
> [osd]
> osd data = /var/lib/ceph/osd/ceph-$id
> osd journal size = 20000
> osd mkfs type = xfs
> osd mkfs options xfs = -f
>
> filestore xattr use omap = true
> filestore min sync interval = 10
> filestore max sync interval = 15
> filestore queue max ops = 25000
> filestore queue max bytes = 10485760
> filestore queue committing max ops = 5000
> filestore queue committing max bytes = 10485760000
>
> journal max write bytes = 1073714824
> journal max write entries = 10000
> journal queue max ops = 50000
> journal queue max bytes = 10485760000
>
> osd max write size = 512
> osd client message size cap = 2147483648
> osd deep scrub stride = 131072
> osd op threads = 8
> osd disk threads = 4
> osd map cache size = 1024
> osd map cache bl size = 128
> mon allow pool delete = true
>
> cluster network = 10.10.73.0/24
>
> max open files = 131072
>
> [mon]
> mon data = /var/lib/ceph/mon/ceph-$id
>
> [osd]
> osd data = /var/lib/ceph/osd/ceph-$id
> osd journal size = 20000
> osd mkfs type = xfs
> osd mkfs options xfs = -f
>
> filestore xattr use omap = true
> filestore min sync interval = 10
> filestore max sync interval = 15
> filestore queue max ops = 25000
> filestore queue max bytes = 10485760
> filestore queue committing max ops = 5000
> filestore queue committing max bytes = 10485760000
>
> journal max write bytes = 1073714824
> journal max write entries = 10000
> journal queue max ops = 50000
> journal queue max bytes = 10485760000
>
> osd max write size = 512
> osd client message size cap = 2147483648
> osd deep scrub stride = 131072
> osd op threads = 8
> osd disk threads = 4
> osd map cache size = 1024
> osd map cache bl size = 128
> osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
> osd recovery op priority = 4
> osd recovery max active = 10
> osd max backfills = 4
> osd skip data digest = true
>
> [client]
> rbd cache = true
> rbd cache size = 268435456
> rbd cache max dirty = 134217728
> rbd cache max dirty age = 5
> ```
>
> ceph health:
> ```
> master@~/ ceph -s
>   cluster:
>     id:     2ed909ef-e3d7-4081-b01a-d04d12a1155d
>     health: HEALTH_ERR
>             4 scrub errors
>             Possible data damage: 1 pg inconsistent
>
>   services:
>     mon: 3 daemons, quorum node2,node3,master
>     mgr: master(active)
>     mds: cephfs-1/1/1 up  {0=master=up:active(laggy or crashed)}
>     osd: 5 osds: 5 up, 5 in
>
>   data:
>     pools:   2 pools, 300 pgs
>     objects: 194.1 k objects, 33 GiB
>     usage:   131 GiB used, 11 TiB / 11 TiB avail
>     pgs:     299 active+clean
>              1   active+clean+inconsistent
> ```
>
> ceph health detail
> ```
> ceph health detail
> HEALTH_ERR 4 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 4 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
>     pg 1.43 is active+clean+inconsistent, acting [5,8,7]
> ```
>
> mds logs have already been provided. Sincerely appreciate reading through it all.
>
> Thanks,
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com