Maybe this requires some attention. I have a default centos7 (maybe not the most recent kernel though), ceph luminous setup eg. no different kernels. This is 2nd or 3rd time that a vm is going into a high load (151) and stopping its services. I have two vm's both mounting the same 2 cephfs 'shares'. After the last incident I dismounted the shares on the 2nd server. (Migrating to a new environment this 2nd server is not doing anything). Last time I thought maybe this could be related to my work on the switch from the stupid allocator to the bitmap. Anyway yesterday I thought lets mount again the 2 shares on the 2nd server, see what happens. And this morning the high load was back. Afaik the 2nd server is only doing a cron job on the cephfs mounts, creating snapshots. 1) I have now still increased load on the osd nodes, from cephfs. How can I see what client is doing this? I don’t seem to get this from 'ceph daemon mds.c session ls' however 'ceph osd pool stats | grep client -B 1' indicates it is cephfs. 2) ceph osd blacklist ls No blacklist entries 3) the first server keeps generating such messages, while there is no issue with connectivity. [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: osd25 192.168.10.114:6804 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: osd18 192.168.10.112:6802 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: osd22 192.168.10.111:6811 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session established PS dmesg -T gives me strange times, as you can see these are in the future, os time is 2 min behind (which is the correct one, ntpd sync). [@ ]# uptime 10:39:17 up 50 days, 13:31, 2 users, load average: 3.60, 3.02, 2.57 4) unmount the filesystem on the first server fails. 5) evicting the cephfs sessions of the first server, does not change the load of the cephfs on the osd nodes. 6) unmounting all cephfs clients, still leaves me with cephfs activity on the data pool and on the osd nodes. [@c03 ~]# ceph daemon mds.c session ls [] 7) On the first server [@~]# ps -auxf| grep D USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6716 3.0 0.0 0 0 ? D 10:18 0:59 \_ [kworker/0:2] root 20039 0.0 0.0 123520 1212 pts/0 D+ 10:28 0:00 | \_ umount /home/mail-archive/ [@ ~]# cat /proc/6716/stack [<ffffffff8385e110>] __wait_on_freeing_inode+0xb0/0xf0 [<ffffffff8385e1e9>] find_inode+0x99/0xc0 [<ffffffff8385e281>] ilookup5_nowait+0x71/0x90 [<ffffffff8385f09f>] ilookup5+0xf/0x60 [<ffffffffc060fb35>] remove_session_caps+0xf5/0x1d0 [ceph] [<ffffffffc06158fc>] dispatch+0x39c/0xb00 [ceph] [<ffffffffc052afb4>] try_read+0x514/0x12c0 [libceph] [<ffffffffc052bf64>] ceph_con_workfn+0xe4/0x1530 [libceph] [<ffffffff836b9e3f>] process_one_work+0x17f/0x440 [<ffffffff836baed6>] worker_thread+0x126/0x3c0 [<ffffffff836c1d21>] kthread+0xd1/0xe0 [<ffffffff83d75c37>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff [@ ~]# cat /proc/20039/stack [<ffffffff837b5e14>] __lock_page+0x74/0x90 [<ffffffff837c744c>] truncate_inode_pages_range+0x6cc/0x700 [<ffffffff837c74ef>] truncate_inode_pages_final+0x4f/0x60 [<ffffffff8385f02c>] evict+0x16c/0x180 [<ffffffff8385f87c>] iput+0xfc/0x190 [<ffffffff8385aa18>] shrink_dcache_for_umount_subtree+0x158/0x1e0 [<ffffffff8385c3bf>] shrink_dcache_for_umount+0x2f/0x60 [<ffffffff8384426f>] generic_shutdown_super+0x1f/0x100 [<ffffffff838446b2>] kill_anon_super+0x12/0x20 [<ffffffffc05ea130>] ceph_kill_sb+0x30/0x80 [ceph] [<ffffffff83844a6e>] deactivate_locked_super+0x4e/0x70 [<ffffffff838451f6>] deactivate_super+0x46/0x60 [<ffffffff8386373f>] cleanup_mnt+0x3f/0x80 [<ffffffff838637d2>] __cleanup_mnt+0x12/0x20 [<ffffffff836be88b>] task_work_run+0xbb/0xe0 [<ffffffff8362bc65>] do_notify_resume+0xa5/0xc0 [<ffffffff83d76134>] int_signal+0x12/0x17 [<ffffffffffffffff>] 0xffffffffffffffff What to do now? In ceph.conf I only have these entries, not sure if I still should keep them. # 100k+ files in 2 folders mds bal fragment size max = 120000 mds_session_blacklist_on_timeout = false mds_session_blacklist_on_evict = false mds_cache_memory_limit = 8000000000 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com