I decided to restart osd.0, then the load of the cephfs and on all osd nodes dropped. After this I still have on the first server [@~]# cat /sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client357431 0/osdc REQUESTS 0 homeless 0 LINGER REQUESTS BACKOFFS [@~]# cat /sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client358422 4/osdc REQUESTS 2 homeless 0 317841 osd0 20.d6ec44c1 20.1 [0,28,5]/0 [0,28,5]/0 e65040 10001b44a70.00000000 0x40001c 102023 read 317853 osd0 20.5956d31b 20.1b [0,5,10]/0 [0,5,10]/0 e65040 10001ad8962.00000000 0x40001c 40731 read LINGER REQUESTS BACKOFFS And dmesg -T keeps giving me these (again with wrong timestamps) [Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session established [Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session established [Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 io error [Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session lost, hunting for new mon [Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 session established [Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session established [Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session established [Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 io error [Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session lost, hunting for new mon What to do now? Restarting the monitor did not help. -----Original Message----- Subject: Re: Luminous cephfs maybe not to stable as expected? Forgot to add these [@ ~]# cat /sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client357431 0/osdc REQUESTS 0 homeless 0 LINGER REQUESTS BACKOFFS [@~]# cat /sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client358422 4/osdc REQUESTS 38 homeless 0 317841 osd0 20.d6ec44c1 20.1 [0,28,5]/0 [0,28,5]/0 e65040 10001b44a70.00000000 0x40001c 101139 read 317853 osd0 20.5956d31b 20.1b [0,5,10]/0 [0,5,10]/0 e65040 10001ad8962.00000000 0x40001c 39847 read 317835 osd3 20.ede889de 20.1e [3,12,27]/3 [3,12,27]/3 e65040 10001ad80f6.00000000 0x40001c 87758 read 317838 osd3 20.7b730a4e 20.e [3,31,9]/3 [3,31,9]/3 e65040 10001ad89d8.00000000 0x40001c 83444 read 317844 osd3 20.feead84c 20.c [3,13,18]/3 [3,13,18]/3 e65040 10001ad8733.00000000 0x40001c 77267 read 317852 osd3 20.bd2658e 20.e [3,31,9]/3 [3,31,9]/3 e65040 10001ad7e00.00000000 0x40001c 39331 read 317830 osd4 20.922e6d04 20.4 [4,16,27]/4 [4,16,27]/4 e65040 10001ad80f2.00000000 0x40001c 86326 read 317837 osd4 20.fe93d4ab 20.2b [4,14,25]/4 [4,14,25]/4 e65040 10001ad80fb.00000000 0x40001c 78951 read 317839 osd4 20.d7af926b 20.2b [4,14,25]/4 [4,14,25]/4 e65040 10001ad80ee.00000000 0x40001c 77556 read 317849 osd5 20.5fcb95c5 20.5 [5,18,29]/5 [5,18,29]/5 e65040 10001ad7f75.00000000 0x40001c 61147 read 317857 osd5 20.28764e9a 20.1a [5,7,28]/5 [5,7,28]/5 e65040 10001ad8a10.00000000 0x40001c 30369 read 317859 osd5 20.7bb79985 20.5 [5,18,29]/5 [5,18,29]/5 e65040 10001ad7fe8.00000000 0x40001c 27942 read 317836 osd8 20.e7bf5cf4 20.34 [8,5,10]/8 [8,5,10]/8 e65040 10001ad7d79.00000000 0x40001c 133699 read 317842 osd8 20.abbb9df4 20.34 [8,5,10]/8 [8,5,10]/8 e65040 10001d5903f.00000000 0x40001c 125308 read 317850 osd8 20.ecd0034 20.34 [8,5,10]/8 [8,5,10]/8 e65040 10001ad89b2.00000000 0x40001c 68348 read 317854 osd8 20.cef50134 20.34 [8,5,10]/8 [8,5,10]/8 e65040 10001ad8728.00000000 0x40001c 57431 read 317861 osd8 20.3e859bb4 20.34 [8,5,10]/8 [8,5,10]/8 e65040 10001ad8108.00000000 0x40001c 50642 read 317847 osd9 20.fc9e9f43 20.3 [9,29,17]/9 [9,29,17]/9 e65040 10001ad8101.00000000 0x40001c 88464 read 317848 osd9 20.d32b6ac3 20.3 [9,29,17]/9 [9,29,17]/9 e65040 10001ad8100.00000000 0x40001c 85929 read 317862 osd11 20.ee6cc689 20.9 [11,0,12]/11 [11,0,12]/11 e65040 10001ad7d64.00000000 0x40001c 40266 read 317843 osd12 20.a801f0e9 20.29 [12,26,8]/12 [12,26,8]/12 e65040 10001ad7f07.00000000 0x40001c 86610 read 317851 osd12 20.8bb48de9 20.29 [12,26,8]/12 [12,26,8]/12 e65040 10001ad7e4f.00000000 0x40001c 46746 read 317860 osd12 20.47815f36 20.36 [12,0,28]/12 [12,0,28]/12 e65040 10001ad8035.00000000 0x40001c 35249 read 317831 osd15 20.9e3acb53 20.13 [15,0,1]/15 [15,0,1]/15 e65040 10001ad8978.00000000 0x40001c 85329 read 317840 osd15 20.2a40efdf 20.1f [15,4,17]/15 [15,4,17]/15 e65040 10001ad7ef8.00000000 0x40001c 76282 read 317846 osd15 20.8143f15f 20.1f [15,4,17]/15 [15,4,17]/15 e65040 10001ad89d1.00000000 0x40001c 61297 read 317864 osd15 20.c889a49c 20.1c [15,0,31]/15 [15,0,31]/15 e65040 10001ad89fb.00000000 0x40001c 24385 read 317832 osd18 20.f76227a 20.3a [18,6,15]/18 [18,6,15]/18 e65040 10001ad8020.00000000 0x40001c 82852 read 317833 osd18 20.d8edab31 20.31 [18,29,14]/18 [18,29,14]/18 e65040 10001ad8952.00000000 0x40001c 82852 read 317858 osd18 20.8f69d231 20.31 [18,29,14]/18 [18,29,14]/18 e65040 10001ad8176.00000000 0x40001c 32400 read 317855 osd22 20.b3342c0f 20.f [22,18,31]/22 [22,18,31]/22 e65040 10001ad8146.00000000 0x40001c 51024 read 317863 osd23 20.cde0ce7b 20.3b [23,1,6]/23 [23,1,6]/23 e65040 10001ad856c.00000000 0x40001c 34521 read 317865 osd23 20.702d2dfe 20.3e [23,9,22]/23 [23,9,22]/23 e65040 10001ad8a5e.00000000 0x40001c 30664 read 317866 osd23 20.cb4a32fe 20.3e [23,9,22]/23 [23,9,22]/23 e65040 10001ad8575.00000000 0x40001c 29683 read 317867 osd23 20.9a008910 20.10 [23,12,6]/23 [23,12,6]/23 e65040 10001ad7d24.00000000 0x40001c 29683 read 317834 osd25 20.6efd4911 20.11 [25,4,0]/25 [25,4,0]/25 e65040 10001ad8023.00000000 0x40001c 147589 read 317856 osd26 20.febb382a 20.2a [26,0,18]/26 [26,0,18]/26 e65040 10001ad8145.00000000 0x40001c 65169 read 317845 osd27 20.5b433067 20.27 [27,7,14]/27 [27,7,14]/27 e65040 10001ad8965.00000000 0x40001c 124461 read LINGER REQUESTS BACKOFFS -----Original Message----- Subject: Luminous cephfs maybe not to stable as expected? Maybe this requires some attention. I have a default centos7 (maybe not the most recent kernel though), ceph luminous setup eg. no different kernels. This is 2nd or 3rd time that a vm is going into a high load (151) and stopping its services. I have two vm's both mounting the same 2 cephfs 'shares'. After the last incident I dismounted the shares on the 2nd server. (Migrating to a new environment this 2nd server is not doing anything). Last time I thought maybe this could be related to my work on the switch from the stupid allocator to the bitmap. Anyway yesterday I thought lets mount again the 2 shares on the 2nd server, see what happens. And this morning the high load was back. Afaik the 2nd server is only doing a cron job on the cephfs mounts, creating snapshots. 1) I have now still increased load on the osd nodes, from cephfs. How can I see what client is doing this? I don’t seem to get this from 'ceph daemon mds.c session ls' however 'ceph osd pool stats | grep client -B 1' indicates it is cephfs. 2) ceph osd blacklist ls No blacklist entries 3) the first server keeps generating such messages, while there is no issue with connectivity. [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: osd25 192.168.10.114:6804 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: osd18 192.168.10.112:6802 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: osd22 192.168.10.111:6811 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session established PS dmesg -T gives me strange times, as you can see these are in the future, os time is 2 min behind (which is the correct one, ntpd sync). [@ ]# uptime 10:39:17 up 50 days, 13:31, 2 users, load average: 3.60, 3.02, 2.57 4) unmount the filesystem on the first server fails. 5) evicting the cephfs sessions of the first server, does not change the load of the cephfs on the osd nodes. 6) unmounting all cephfs clients, still leaves me with cephfs activity on the data pool and on the osd nodes. [@c03 ~]# ceph daemon mds.c session ls [] 7) On the first server [@~]# ps -auxf| grep D USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6716 3.0 0.0 0 0 ? D 10:18 0:59 \_ [kworker/0:2] root 20039 0.0 0.0 123520 1212 pts/0 D+ 10:28 0:00 | \_ umount /home/mail-archive/ [@ ~]# cat /proc/6716/stack [<ffffffff8385e110>] __wait_on_freeing_inode+0xb0/0xf0 [<ffffffff8385e1e9>] find_inode+0x99/0xc0 [<ffffffff8385e281>] ilookup5_nowait+0x71/0x90 [<ffffffff8385f09f>] ilookup5+0xf/0x60 [<ffffffffc060fb35>] remove_session_caps+0xf5/0x1d0 [ceph] [<ffffffffc06158fc>] dispatch+0x39c/0xb00 [ceph] [<ffffffffc052afb4>] try_read+0x514/0x12c0 [libceph] [<ffffffffc052bf64>] ceph_con_workfn+0xe4/0x1530 [libceph] [<ffffffff836b9e3f>] process_one_work+0x17f/0x440 [<ffffffff836baed6>] worker_thread+0x126/0x3c0 [<ffffffff836c1d21>] kthread+0xd1/0xe0 [<ffffffff83d75c37>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff [@ ~]# cat /proc/20039/stack [<ffffffff837b5e14>] __lock_page+0x74/0x90 [<ffffffff837c744c>] truncate_inode_pages_range+0x6cc/0x700 [<ffffffff837c74ef>] truncate_inode_pages_final+0x4f/0x60 [<ffffffff8385f02c>] evict+0x16c/0x180 [<ffffffff8385f87c>] iput+0xfc/0x190 [<ffffffff8385aa18>] shrink_dcache_for_umount_subtree+0x158/0x1e0 [<ffffffff8385c3bf>] shrink_dcache_for_umount+0x2f/0x60 [<ffffffff8384426f>] generic_shutdown_super+0x1f/0x100 [<ffffffff838446b2>] kill_anon_super+0x12/0x20 [<ffffffffc05ea130>] ceph_kill_sb+0x30/0x80 [ceph] [<ffffffff83844a6e>] deactivate_locked_super+0x4e/0x70 [<ffffffff838451f6>] deactivate_super+0x46/0x60 [<ffffffff8386373f>] cleanup_mnt+0x3f/0x80 [<ffffffff838637d2>] __cleanup_mnt+0x12/0x20 [<ffffffff836be88b>] task_work_run+0xbb/0xe0 [<ffffffff8362bc65>] do_notify_resume+0xa5/0xc0 [<ffffffff83d76134>] int_signal+0x12/0x17 [<ffffffffffffffff>] 0xffffffffffffffff What to do now? In ceph.conf I only have these entries, not sure if I still should keep them. # 100k+ files in 2 folders mds bal fragment size max = 120000 mds_session_blacklist_on_timeout = false mds_session_blacklist_on_evict = false mds_cache_memory_limit = 8000000000 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com