Hi all, We found that some process writing cephfs will hang for a long time (> 120s) when uploading(scp/rsync) large files(totally 50G ~ 300G)to the app node's cephfs mountpoint. This problem is not always reproduciable. But when this problem occurs, the web(nginx) or some other services can’t be access, it’s embarrassing. Please give some tips. We are using ceph version 13.2.1 and 3.10.0-693.el7.x86_64 kernel. When this problem occurs, the memory pressure is a bit high. # free -h total used free shared buff/cache available Mem: 30G 11G 230M 3.3M 18G 18G Swap: 0B 0B 0B # mpstat -P ALL Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 3.80 0.04 1.46 55.12 0.00 0.08 0.00 0.00 0.00 39.49 Average: 0 1.52 0.00 0.51 97.98 0.00 0.00 0.00 0.00 0.00 0.00 Average: 1 13.07 0.00 1.51 0.00 0.00 0.00 0.00 0.00 0.00 85.43 Average: 2 3.00 0.00 3.00 0.50 0.00 0.00 0.00 0.00 0.00 93.50 Average: 3 2.50 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 95.50 Average: 4 1.50 0.50 1.50 96.00 0.00 0.50 0.00 0.00 0.00 0.00 Average: 5 0.50 0.00 1.01 32.16 0.00 0.00 0.00 0.00 0.00 66.33 Average: 6 2.49 0.00 2.49 0.00 0.00 0.50 0.00 0.00 0.00 94.53 Average: 7 1.52 0.00 0.51 97.98 0.00 0.00 0.00 0.00 0.00 0.00 Average: 8 1.50 0.00 1.50 97.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: 9 14.07 0.00 2.01 46.23 0.00 0.00 0.00 0.00 0.00 37.69 Average: 10 2.00 0.00 0.50 97.50 0.00 0.00 0.00 0.00 0.00 0.00 Average: 11 2.01 0.00 1.01 96.98 0.00 0.00 0.00 0.00 0.00 0.00 We can see some process in state ‘D’ for a long time. here are 2 fragments in dmesg. [Fri Nov 30 15:10:43 2018] INFO: task java:11662 blocked for more than 120 seconds. [Fri Nov 30 15:10:43 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Nov 30 15:10:43 2018] java D ffffffff816a4180 0 11662 10409 0x00000180 [Fri Nov 30 15:10:43 2018] ffff8805f4febbd0 0000000000000082 ffff8807e2edcf10 ffff8805f4febfd8 [Fri Nov 30 15:10:43 2018] ffff8805f4febfd8 ffff8805f4febfd8 ffff8807e2edcf10 ffff88086df56cc0 [Fri Nov 30 15:10:43 2018] 0000000000000000 7fffffffffffffff ffff88087ffc3be8 ffffffff816a4180 [Fri Nov 30 15:10:43 2018] Call Trace: [Fri Nov 30 15:10:43 2018] [<ffffffff816a4180>] ? bit_wait+0x50/0x50 [Fri Nov 30 15:10:43 2018] [<ffffffff816a6059>] schedule+0x29/0x70 [Fri Nov 30 15:10:43 2018] [<ffffffff816a3b69>] schedule_timeout+0x239/0x2c0 [Fri Nov 30 15:10:43 2018] [<ffffffff81190541>] ? pagevec_lookup_tag+0x21/0x30 [Fri Nov 30 15:10:43 2018] [<ffffffffc082b598>] ? ceph_writepages_start+0x1298/0x1520 [ceph] [Fri Nov 30 15:10:43 2018] [<ffffffff810e939c>] ? ktime_get_ts64+0x4c/0xf0 [Fri Nov 30 15:10:43 2018] [<ffffffff816a4180>] ? bit_wait+0x50/0x50 [Fri Nov 30 15:10:43 2018] [<ffffffff816a56dd>] io_schedule_timeout+0xad/0x130 [Fri Nov 30 15:10:43 2018] [<ffffffff816a5778>] io_schedule+0x18/0x20 [Fri Nov 30 15:10:43 2018] [<ffffffff816a4191>] bit_wait_io+0x11/0x50 [Fri Nov 30 15:10:43 2018] [<ffffffff816a3cb5>] __wait_on_bit+0x65/0x90 [Fri Nov 30 15:10:43 2018] [<ffffffff81181cb1>] wait_on_page_bit+0x81/0xa0 [Fri Nov 30 15:10:43 2018] [<ffffffff810b19d0>] ? wake_bit_function+0x40/0x40 [Fri Nov 30 15:10:43 2018] [<ffffffff81181de1>] __filemap_fdatawait_range+0x111/0x190 [Fri Nov 30 15:10:43 2018] [<ffffffff811b0d79>] ? do_numa_page+0x159/0x1e0 [Fri Nov 30 15:10:43 2018] [<ffffffff8118f42e>] ? do_writepages+0x1e/0x40 [Fri Nov 30 15:10:43 2018] [<ffffffff81181e74>] filemap_fdatawait_range+0x14/0x30 [Fri Nov 30 15:10:43 2018] [<ffffffff81183db6>] filemap_write_and_wait_range+0x56/0x90 [Fri Nov 30 15:10:43 2018] [<ffffffffc083326c>] ceph_fsync+0x6c/0x510 [ceph] [Fri Nov 30 15:10:43 2018] [<ffffffff81137273>] ? __secure_computing+0x73/0x240 [Fri Nov 30 15:10:43 2018] [<ffffffff8111f5c6>] ? __audit_syscall_exit+0x1e6/0x280 [Fri Nov 30 15:10:43 2018] [<ffffffff81231d85>] do_fsync+0x65/0xa0 [Fri Nov 30 15:10:43 2018] [<ffffffff81232083>] SyS_fdatasync+0x13/0x20 [Fri Nov 30 15:10:43 2018] [<ffffffff816b1d4e>] tracesys+0xd9/0xde [Fri Nov 30 15:12:43 2018] INFO: task prometheus:9919 blocked for more than 120 seconds. [Fri Nov 30 15:12:43 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Nov 30 15:12:43 2018] prometheus D ffff8807e87dbad8 0 9919 9879 0x00000080 [Fri Nov 30 15:12:43 2018] ffff8807e87db970 0000000000000082 ffff8803e273cf10 ffff8807e87dbfd8 [Fri Nov 30 15:12:43 2018] ffff8807e87dbfd8 ffff8807e87dbfd8 ffff8803e273cf10 ffff88046e716cc0 [Fri Nov 30 15:12:43 2018] 0000000000000000 7fffffffffffffff ffffffff816a4180 ffff8807e87dbad8 [Fri Nov 30 15:12:43 2018] Call Trace: [Fri Nov 30 15:12:43 2018] [<ffffffff816a4180>] ? bit_wait+0x50/0x50 [Fri Nov 30 15:12:43 2018] [<ffffffff816a6059>] schedule+0x29/0x70 [Fri Nov 30 15:12:43 2018] [<ffffffff816a3b69>] schedule_timeout+0x239/0x2c0 [Fri Nov 30 15:12:43 2018] [<ffffffff81090007>] ? local_bh_enable+0x17/0x20 [Fri Nov 30 15:12:43 2018] [<ffffffff816a4180>] ? bit_wait+0x50/0x50 [Fri Nov 30 15:12:43 2018] [<ffffffff816a56dd>] io_schedule_timeout+0xad/0x130 [Fri Nov 30 15:12:43 2018] [<ffffffff816a5778>] io_schedule+0x18/0x20 [Fri Nov 30 15:12:43 2018] [<ffffffff816a4191>] bit_wait_io+0x11/0x50 [Fri Nov 30 15:12:43 2018] [<ffffffff816a3eaf>] __wait_on_bit_lock+0x5f/0xc0 [Fri Nov 30 15:12:43 2018] [<ffffffff81182094>] __lock_page+0x74/0x90 [Fri Nov 30 15:12:43 2018] [<ffffffff810b19d0>] ? wake_bit_function+0x40/0x40 [Fri Nov 30 15:12:43 2018] [<ffffffff81182a34>] __find_lock_page+0x54/0x70 [Fri Nov 30 15:12:43 2018] [<ffffffff811836e2>] grab_cache_page_write_begin+0x62/0xd0 [Fri Nov 30 15:12:43 2018] [<ffffffffc082a203>] ceph_write_begin+0x43/0xe0 [ceph] [Fri Nov 30 15:12:43 2018] [<ffffffff811824ce>] generic_file_buffered_write+0x11e/0x2a0 [Fri Nov 30 15:12:43 2018] [<ffffffffc08259c4>] ceph_aio_write+0xac4/0xc60 [ceph] [Fri Nov 30 15:12:43 2018] [<ffffffff8160417a>] ? inet_recvmsg+0x7a/0xa0 [Fri Nov 30 15:12:43 2018] [<ffffffff811feb3d>] do_sync_write+0x8d/0xd0 [Fri Nov 30 15:12:43 2018] [<ffffffff811ff5fd>] vfs_write+0xbd/0x1e0 [Fri Nov 30 15:12:43 2018] [<ffffffff8120040f>] SyS_write+0x7f/0xe0 [Fri Nov 30 15:12:43 2018] [<ffffffff816b1b49>] system_call_fastpath+0x16/0x1b But the cephfs status shows healthy. # ceph -s cluster: id: b27bc9a3-2e32-4782-b60f-b7901b0b26e5 health: HEALTH_OK services: mon: 2 daemons, quorum skyaxe-storage-0,skyaxe-storage-1 mgr: skyaxe-storage-0(active), standbys: skyaxe-storage-1 mds: cephfs-1/1/1 up {0=skyaxe-storage-1=up:active}, 1 up:standby osd: 8 osds: 8 up, 8 in data: pools: 2 pools, 512 pgs objects: 599.7 k objects, 876 GiB usage: 2.6 TiB used, 13 TiB / 15 TiB avail pgs: 512 active+clean io: client: 3.7 KiB/s wr, 0 op/s rd, 1 op/s wr We can see some unfinished requests in osdc. # cat /sys/kernel/debug/ceph/xxx/osdc REQUESTS 98 homeless 0 8529 osd1 1.1bfdee5 [1,7]/1 [1,7]/1 100000011be.00000000 0x400024 1 0'0 write 8532 osd1 1.2d81c338 [1,6]/1 [1,6]/1 100000011c0.00000000 0x400024 1 0'0 write 8536 osd1 1.49324f26 [1,4]/1 [1,4]/1 100000011c4.00000000 0x400024 1 0'0 write 8542 osd1 1.1403106f [1,7]/1 [1,7]/1 100000011c8.00000000 0x400024 1 0'0 write 8549 osd1 1.ea273113 [1,4]/1 [1,4]/1 100000011ce.00000000 0x400024 1 0'0 write 8562 osd1 1.f2d68e9a [1,5]/1 [1,5]/1 100000011d8.00000000 0x400024 1 0'0 write 8566 osd1 1.1f212283 [1,4]/1 [1,4]/1 100000011db.00000000 0x400024 1 0'0 write 8568 osd1 1.2e1984e7 [1,5]/1 [1,5]/1 100000011de.00000000 0x400024 1 0'0 write 8576 osd1 1.da7e63d0 [1,5]/1 [1,5]/1 100000011e1.00000000 0x400024 1 0'0 write 8581 osd1 1.3b0d280 [1,4]/1 [1,4]/1 100000011e8.00000000 0x400024 1 0'0 write 8582 osd1 1.80227f0b [1,5]/1 [1,5]/1 100000011ea.00000000 0x400024 1 0'0 write 8593 osd1 1.1efd9391 [1,6]/1 [1,6]/1 100000011ef.00000000 0x400024 1 0'0 write 8594 osd1 1.3db6e880 [1,4]/1 [1,4]/1 100000011f0.00000000 0x400024 1 0'0 write 8611 osd1 1.6d20c41d [1,4]/1 [1,4]/1 1000000028c.00000000 0x400024 1 0'0 write 8612 osd1 1.6d20c41d [1,4]/1 [1,4]/1 1000000028c.00000000 0x400024 1 0'0 write 8616 osd1 1.f4d90a4d [1,7]/1 [1,7]/1 10000001206.00000000 0x400024 1 0'0 write 8619 osd1 1.740514fe [1,5]/1 [1,5]/1 10000001208.00000000 0x400024 1 0'0 write 8630 osd1 1.d5fbabe0 [1,6]/1 [1,6]/1 10000001216.00000000 0x400024 1 0'0 write 8631 osd1 1.daf28f83 [1,4]/1 [1,4]/1 10000001217.00000000 0x400024 1 0'0 write 8635 osd1 1.d29e01a6 [1,5]/1 [1,5]/1 1000000121a.00000000 0x400024 1 0'0 write 8640 osd1 1.2de0ff63 [1,7]/1 [1,7]/1 10000001220.00000000 0x400024 1 0'0 write 8658 osd1 1.2f91206f [1,7]/1 [1,7]/1 1000000122c.00000000 0x400064 1 0'0 write,startsync ... Here is our ceph config: # cat /etc/ceph/ceph.conf [global] fsid = b27bc9a3-2e32-4782-b60f-b7901b0b26e5 mon_initial_members = skyaxe-storage-0, skyaxe-storage-1 mon_host = 10.0.30.111,10.0.30.112 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx osd_pool_default_size = 2 public_network = 10.0.30.0/24 cluster_network = 10.0.40.0/24 osd_pool_default_min_size = 1 max_open_files = 131072 [osd] osd_journal_size = 56320 journal_max_write_bytes = 1073741824 journal_max_write_entries = 10000 journal_queue_max_ops = 50000 journal_queue_max_bytes = 10737418240 osd_mkfs_type = xfs osd_mkfs_options_xfs = -f osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" osd_max_write_size = 512 osd_client_message_size_cap = 2147483648 osd_deep_scrub_stride = 131072 osd_op_threads = 8 osd_disk_threads = 4 osd_map_cache_size = 1024 osd_map_cache_bl_size = 128 osd_recovery_op_priority = 4 osd_recovery_max_active = 10 osd_max_backfills = 4 — Best Regards Li, Ning |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com