Hi all, I tried the mds_cache_memory_limit and it doesn’t work. But unexpectedly, we found drop_cache can work. We added a "echo 3 > /proc/sys/vm/drop_caches” cron job. Let it drop cache every 10 minutes and the problem didn’t occur for a whole night. : ) It seems to work but we can’t understand why. :( — Best Regards Li, Ning > On Dec 6, 2018, at 18:46, Li,Ning <lining916740672@xxxxxxxxxx> wrote: > > I tested the dirty_ratio and dirty_background_ratio setting on ceph storage server side. > It didn’t work. > > > The problem reproduced just now. > > > Some process stay in D state for over 10 minitues. > These osd requests is about to take several minutes. > > Here are some logs: > > [Thu Dec 6 17:59:18 2018] INFO: task java:11477 blocked for more than 120 seconds. > [Thu Dec 6 17:59:18 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Thu Dec 6 17:59:18 2018] java D ffffffff816a4180 0 11477 8391 0x00000180 > [Thu Dec 6 17:59:18 2018] ffff8807942bbbd0 0000000000000086 ffff8801f7e79fa0 ffff8807942bbfd8 > [Thu Dec 6 17:59:18 2018] ffff8807942bbfd8 ffff8807942bbfd8 ffff8801f7e79fa0 ffff88086de56cc0 > [Thu Dec 6 17:59:18 2018] 0000000000000000 7fffffffffffffff ffff88047ffa04e8 ffffffff816a4180 > [Thu Dec 6 17:59:18 2018] Call Trace: > [Thu Dec 6 17:59:18 2018] [<ffffffff816a4180>] ? bit_wait+0x50/0x50 > [Thu Dec 6 17:59:18 2018] [<ffffffff816a6059>] schedule+0x29/0x70 > [Thu Dec 6 17:59:18 2018] [<ffffffff816a3b69>] schedule_timeout+0x239/0x2c0 > [Thu Dec 6 17:59:18 2018] [<ffffffff81190541>] ? pagevec_lookup_tag+0x21/0x30 > [Thu Dec 6 17:59:18 2018] [<ffffffffc081f598>] ? ceph_writepages_start+0x1298/0x1520 [ceph] > [Thu Dec 6 17:59:18 2018] [<ffffffff810e939c>] ? ktime_get_ts64+0x4c/0xf0 > [Thu Dec 6 17:59:18 2018] [<ffffffff816a4180>] ? bit_wait+0x50/0x50 > [Thu Dec 6 17:59:18 2018] [<ffffffff816a56dd>] io_schedule_timeout+0xad/0x130 > [Thu Dec 6 17:59:18 2018] [<ffffffff816a5778>] io_schedule+0x18/0x20 > [Thu Dec 6 17:59:18 2018] [<ffffffff816a4191>] bit_wait_io+0x11/0x50 > [Thu Dec 6 17:59:18 2018] [<ffffffff816a3cb5>] __wait_on_bit+0x65/0x90 > [Thu Dec 6 17:59:18 2018] [<ffffffff81181cb1>] wait_on_page_bit+0x81/0xa0 > [Thu Dec 6 17:59:18 2018] [<ffffffff810b19d0>] ? wake_bit_function+0x40/0x40 > [Thu Dec 6 17:59:18 2018] [<ffffffff81181de1>] __filemap_fdatawait_range+0x111/0x190 > [Thu Dec 6 17:59:18 2018] [<ffffffff8118f42e>] ? do_writepages+0x1e/0x40 > [Thu Dec 6 17:59:18 2018] [<ffffffff81181e74>] filemap_fdatawait_range+0x14/0x30 > [Thu Dec 6 17:59:18 2018] [<ffffffff81183db6>] filemap_write_and_wait_range+0x56/0x90 > [Thu Dec 6 17:59:18 2018] [<ffffffffc082726c>] ceph_fsync+0x6c/0x510 [ceph] > [Thu Dec 6 17:59:18 2018] [<ffffffff81137273>] ? __secure_computing+0x73/0x240 > [Thu Dec 6 17:59:18 2018] [<ffffffff8111f5c6>] ? __audit_syscall_exit+0x1e6/0x280 > [Thu Dec 6 17:59:18 2018] [<ffffffff81231d85>] do_fsync+0x65/0xa0 > [Thu Dec 6 17:59:18 2018] [<ffffffff81232083>] SyS_fdatasync+0x13/0x20 > [Thu Dec 6 17:59:18 2018] [<ffffffff816b1d4e>] tracesys+0xd9/0xde > [Thu Dec 6 18:03:38 2018] libceph: osd3 10.0.30.111:6808 socket closed (con state OPEN) > [Thu Dec 6 18:03:39 2018] libceph: osd7 10.0.30.112:6807 socket closed (con state OPEN) > [Thu Dec 6 18:03:39 2018] libceph: osd6 10.0.30.112:6805 socket closed (con state OPEN) > > > cephfs kernel client side osdc: > > REQUESTS 1548 homeless 0 > 124188 osd5 1.7e6c3097 [5,2]/5 [5,2]/5 100004474e2.000006bf 0x400024 1 0'0 write > 124191 osd5 1.37d4eeac [5,2]/5 [5,2]/5 100004474e2.000006c2 0x400024 1 0'0 write > 124192 osd5 1.2506d82d [5,1]/5 [5,1]/5 100004474e2.000006c3 0x400024 1 0'0 write > 124198 osd5 1.5326c49c [5,3]/5 [5,3]/5 100004474e1.00000888 0x400024 1 0'0 write > 124219 osd5 1.37203f43 [5,2]/5 [5,2]/5 100004474e2.000006cb 0x400024 1 0'0 write > 124236 osd5 1.1318aeb5 [5,0]/5 [5,0]/5 100002adc43.00000000 0x400024 1 0'0 write > 124268 osd5 1.b4368bad [5,0]/5 [5,0]/5 100004474e1.000008a2 0x400024 1 0'0 write > 124272 osd5 1.a31398ad [5,0]/5 [5,0]/5 100004474e2.000006dc 0x400024 1 0'0 write > 124279 osd5 1.f6363aff [5,0]/5 [5,0]/5 100004474e2.000006e3 0x400024 1 0'0 write > 124281 osd5 1.e3386fca [5,2]/5 [5,2]/5 100004474e2.000006e5 0x400024 1 0'0 write > 124288 osd5 1.a95ff0ac [5,2]/5 [5,2]/5 100004474e1.000008ab 0x400024 1 0'0 write > 124290 osd5 1.4e9c48ff [5,0]/5 [5,0]/5 100004474e1.000008ad 0x400024 1 0'0 write > 124293 osd5 1.e4721518 [5,2]/5 [5,2]/5 100004474e1.000008b0 0x400024 1 0'0 write > 124299 osd5 1.fae2c501 [5,0]/5 [5,0]/5 100004474e2.000006ec 0x400024 1 0'0 write > 124303 osd5 1.708289ad [5,0]/5 [5,0]/5 100004474e2.000006f0 0x400024 1 0'0 write > 124312 osd5 1.85e93586 [5,0]/5 [5,0]/5 100004474e1.000008b8 0x400024 1 0'0 write > 124314 osd5 1.85146f18 [5,2]/5 [5,2]/5 100004474e1.000008ba 0x400024 1 0'0 write > 124316 osd5 1.b4c5875a [5,0]/5 [5,0]/5 100004474e2.000006f2 0x400024 1 0'0 write > 124336 osd5 1.cc35e38e [5,2]/5 [5,2]/5 100004474e1.000008c5 0x400024 1 0'0 write > 124339 osd5 1.c99de197 [5,2]/5 [5,2]/5 100004474e2.000006fe 0x400024 1 0'0 write > 124364 osd5 1.21ea6cb7 [5,2]/5 [5,2]/5 100004474e2.0000070c 0x400024 1 0'0 write > 124368 osd5 1.411148d8 [5,1]/5 [5,1]/5 100004474e2.00000710 0x400024 1 0'0 write > 124370 osd5 1.c56129da [5,1]/5 [5,1]/5 100004474e2.00000712 0x400024 1 0'0 write > 124380 osd5 1.dd7b1c12 [5,1]/5 [5,1]/5 100004474e1.000008db 0x400024 1 0'0 write > 124384 osd5 1.1878e328 [5,2]/5 [5,2]/5 100004474e2.00000715 0x400024 1 0'0 write > 124390 osd5 1.f992feda [5,1]/5 [5,1]/5 100004474e2.0000071b 0x400024 1 0'0 write > 124396 osd5 1.84f2e0ff [5,0]/5 [5,0]/5 100004474e1.000008e0 0x400024 1 0'0 write > 124398 osd5 1.76e9cbb5 [5,0]/5 [5,0]/5 100004474e1.000008e2 0x400024 1 0'0 write > 124405 osd5 1.3a4c5001 [5,0]/5 [5,0]/5 100004474e2.0000071f 0x400024 1 0'0 write > 124413 osd5 1.91f47ada [5,1]/5 [5,1]/5 100004474e2.00000727 0x400024 1 0'0 write > 124420 osd5 1.da08fa82 [5,1]/5 [5,1]/5 100004474e1.000008ed 0x400024 1 0'0 write > 124422 osd5 1.f0dc5efd [5,3]/5 [5,3]/5 100004474e1.000008ef 0x400024 1 0'0 write > 124435 osd5 1.28ba590a [5,2]/5 [5,2]/5 100004474e2.00000732 0x400024 1 0'0 write > 124442 osd5 1.21439cf [5,1]/5 [5,1]/5 100004474e1.000008f8 0x400024 1 0'0 write > > > ceph server side dumps: > > # ceph daemon osd.5 dump_blocked_ops > { > "ops": [], > "complaint_time": 30.000000, > "num_blocked_ops": 0 > } > # ceph daemon osd.5 dump_ops_in_flight > { > "ops": [], > "num_ops": 0 > } > > # ceph daemon osd.5 dump_historic_slow_ops > { > "num to keep": 20, > "threshold to keep": 10, > "Ops": [ > { > "description": "osd_op(client.447197.1:71890 1.b7884c19 1:983211ed:::100004342f9.000015e6:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.573307", > "age": 649.893033, > "duration": 13.827240, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 71890 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.573307", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.573307", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.573309", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:10.595225", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:10.595278", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:10.595291", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:22.970164", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:22.970227", > "event": "started" > }, > { > "time": "2018-12-06 18:04:22.970273", > "event": "waiting for subops from 6" > }, > { > "time": "2018-12-06 18:04:23.323234", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:24.400524", > "event": "sub_op_commit_rec from 6" > }, > { > "time": "2018-12-06 18:04:24.400541", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:24.400547", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:71905 1.b5503364 1:26cc0aad:::100004342fb.00000c45:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.620624", > "age": 649.845717, > "duration": 12.702676, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 71905 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.620624", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.620624", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.620626", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:10.635423", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:10.635476", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:10.635485", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.057810", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.057896", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.057945", > "event": "waiting for subops from 7" > }, > { > "time": "2018-12-06 18:04:23.237036", > "event": "sub_op_commit_rec from 7" > }, > { > "time": "2018-12-06 18:04:23.323250", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:23.323285", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:23.323300", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:71913 1.9b59c75f 1:fae39ad9:::100004342fb.00000c4d:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.649445", > "age": 649.816895, > "duration": 12.762881, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 71913 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.649445", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.649445", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.649447", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:10.667713", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:10.667783", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:10.667803", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.059423", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.059494", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.059540", > "event": "waiting for subops from 5" > }, > { > "time": "2018-12-06 18:04:23.147873", > "event": "sub_op_commit_rec from 5" > }, > { > "time": "2018-12-06 18:04:23.412282", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:23.412312", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:23.412326", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:71924 1.43f2a896 1:69154fc2:::100004342f9.000015f9:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.680689", > "age": 649.785651, > "duration": 12.731666, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 71924 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.680689", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.680689", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.680692", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:10.702583", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:10.702627", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:10.702634", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.147879", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.147979", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.148065", > "event": "waiting for subops from 4" > }, > { > "time": "2018-12-06 18:04:23.237049", > "event": "sub_op_commit_rec from 4" > }, > { > "time": "2018-12-06 18:04:23.412333", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:23.412349", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:23.412356", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:71935 1.df579f5f 1:faf9eafb:::100004342fb.00000c54:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.751790", > "age": 649.714550, > "duration": 12.747687, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 71935 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.751790", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.751790", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.751792", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:10.764998", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:10.765049", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:10.765059", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.149915", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.149990", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.150037", > "event": "waiting for subops from 5" > }, > { > "time": "2018-12-06 18:04:23.237059", > "event": "sub_op_commit_rec from 5" > }, > { > "time": "2018-12-06 18:04:23.499434", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:23.499466", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:23.499477", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:71948 1.30315755 1:aaea8c0c:::100004342f9.00001602:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.765065", > "age": 649.701276, > "duration": 12.734734, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 71948 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.765065", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.765065", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.765067", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:10.789073", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:10.789125", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:10.789136", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.237064", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.237142", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.237187", > "event": "waiting for subops from 5" > }, > { > "time": "2018-12-06 18:04:23.323299", > "event": "sub_op_commit_rec from 5" > }, > { > "time": "2018-12-06 18:04:23.499499", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:23.499513", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:23.499799", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:71952 1.e0451baf 1:f5d8a207:::100004342f9.00001606:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.789142", > "age": 649.677198, > "duration": 12.809745, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 71952 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.789142", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.789142", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.789145", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:10.802180", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:10.802240", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:10.802249", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.238710", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.238776", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.238890", > "event": "waiting for subops from 7" > }, > { > "time": "2018-12-06 18:04:23.323284", > "event": "sub_op_commit_rec from 7" > }, > { > "time": "2018-12-06 18:04:23.598857", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:23.598878", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:23.598888", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:71987 1.87226b4b 1:d2d644e1:::100004342f9.0000161a:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.875449", > "age": 649.590892, > "duration": 12.723466, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 71987 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.875449", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.875449", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.875450", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:10.889074", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:10.889109", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:10.889113", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.323306", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.323402", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.323465", > "event": "waiting for subops from 4" > }, > { > "time": "2018-12-06 18:04:23.412387", > "event": "sub_op_commit_rec from 4" > }, > { > "time": "2018-12-06 18:04:23.598895", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:23.598908", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:23.598915", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72009 1.6e26a0cd 1:b3056476:::100004342f9.00001621:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.951671", > "age": 649.514669, > "duration": 13.559266, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72009 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.951671", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.951671", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.951673", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:10.971212", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:10.971265", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:10.971274", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.325006", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.325073", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.325120", > "event": "waiting for subops from 6" > }, > { > "time": "2018-12-06 18:04:23.689393", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:24.510912", > "event": "sub_op_commit_rec from 6" > }, > { > "time": "2018-12-06 18:04:24.510930", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:24.510937", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72027 1.2675011e 1:7880ae64:::100004342fb.00000c83:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.983790", > "age": 649.482551, > "duration": 13.631767, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72027 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.983790", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.983790", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.983791", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:10.997441", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:10.997486", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:10.997493", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.412425", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.412492", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.412539", > "event": "waiting for subops from 6" > }, > { > "time": "2018-12-06 18:04:23.782075", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:24.615518", > "event": "sub_op_commit_rec from 6" > }, > { > "time": "2018-12-06 18:04:24.615545", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:24.615556", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72047 1.20d53eaf 1:f57cab04:::100004342f9.00001638:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:10.997499", > "age": 649.468842, > "duration": 12.894479, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72047 > }, > "events": [ > { > "time": "2018-12-06 18:04:10.997499", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:10.997499", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:10.997500", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:11.010879", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:11.010920", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:11.010925", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.499523", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.499599", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.499652", > "event": "waiting for subops from 7" > }, > { > "time": "2018-12-06 18:04:23.598911", > "event": "sub_op_commit_rec from 7" > }, > { > "time": "2018-12-06 18:04:23.891906", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:23.891960", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:23.891977", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72114 1.5c7de55 1:aa7be3a0:::100004342fb.00000cad:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:11.111490", > "age": 649.354851, > "duration": 12.780517, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72114 > }, > "events": [ > { > "time": "2018-12-06 18:04:11.111490", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:11.111490", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:11.111492", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:11.127195", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:11.127236", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:11.127242", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.598920", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.598999", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.599048", > "event": "waiting for subops from 5" > }, > { > "time": "2018-12-06 18:04:23.688752", > "event": "sub_op_commit_rec from 5" > }, > { > "time": "2018-12-06 18:04:23.891986", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:23.892001", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:23.892007", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72124 1.59941c4b 1:d238299a:::100004342fb.00000cb7:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:11.127248", > "age": 649.339093, > "duration": 12.962263, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72124 > }, > "events": [ > { > "time": "2018-12-06 18:04:11.127248", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:11.127248", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:11.127250", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:11.137801", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:11.137837", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:11.137842", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.600686", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.600750", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.600795", > "event": "waiting for subops from 4" > }, > { > "time": "2018-12-06 18:04:23.688740", > "event": "sub_op_commit_rec from 4" > }, > { > "time": "2018-12-06 18:04:24.089450", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:24.089493", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:24.089511", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72136 1.4c3df65f 1:fa6fbc32:::100004342f9.00001664:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:11.161272", > "age": 649.305069, > "duration": 12.928269, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72136 > }, > "events": [ > { > "time": "2018-12-06 18:04:11.161272", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:11.161272", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:11.161273", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:11.176621", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:11.176665", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:11.176672", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.688809", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.688878", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.688924", > "event": "waiting for subops from 5" > }, > { > "time": "2018-12-06 18:04:23.781472", > "event": "sub_op_commit_rec from 5" > }, > { > "time": "2018-12-06 18:04:24.089520", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:24.089534", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:24.089541", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72202 1.de800dcd 1:b3b0017b:::100004342fb.00000cd8:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:11.235424", > "age": 649.230917, > "duration": 13.494299, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72202 > }, > "events": [ > { > "time": "2018-12-06 18:04:11.235424", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:11.235424", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:11.235425", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:11.244364", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:11.244400", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:11.244406", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.690409", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.690473", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.690519", > "event": "waiting for subops from 6" > }, > { > "time": "2018-12-06 18:04:24.175400", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:24.729691", > "event": "sub_op_commit_rec from 6" > }, > { > "time": "2018-12-06 18:04:24.729714", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:24.729723", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72222 1.bc0cdac3 1:c35b303d:::100004342f9.00001694:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:11.283182", > "age": 649.183159, > "duration": 12.980151, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72222 > }, > "events": [ > { > "time": "2018-12-06 18:04:11.283182", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:11.283182", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:11.283183", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:11.290043", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:11.290087", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:11.290095", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.781478", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.781545", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.781591", > "event": "waiting for subops from 7" > }, > { > "time": "2018-12-06 18:04:23.892055", > "event": "sub_op_commit_rec from 7" > }, > { > "time": "2018-12-06 18:04:24.263304", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:24.263325", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:24.263333", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72237 1.567a0855 1:aa105e6a:::100004342f9.000016a3:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:11.305424", > "age": 649.160916, > "duration": 13.052176, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72237 > }, > "events": [ > { > "time": "2018-12-06 18:04:11.305424", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:11.305424", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:11.305426", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:11.312298", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:11.312335", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:11.312340", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:23.892064", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:23.892156", > "event": "started" > }, > { > "time": "2018-12-06 18:04:23.892216", > "event": "waiting for subops from 5" > }, > { > "time": "2018-12-06 18:04:24.089563", > "event": "sub_op_commit_rec from 5" > }, > { > "time": "2018-12-06 18:04:24.357583", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:24.357594", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:24.357600", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72249 1.960d9955 1:aa99b069:::100004342f9.000016af:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:11.319524", > "age": 649.146817, > "duration": 13.038095, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72249 > }, > "events": [ > { > "time": "2018-12-06 18:04:11.319524", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:11.319524", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:11.319525", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:11.327928", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:11.327967", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:11.327973", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:24.089569", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:24.089666", > "event": "started" > }, > { > "time": "2018-12-06 18:04:24.089730", > "event": "waiting for subops from 5" > }, > { > "time": "2018-12-06 18:04:24.175469", > "event": "sub_op_commit_rec from 5" > }, > { > "time": "2018-12-06 18:04:24.357605", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:24.357614", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:24.357618", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:72256 1.38b4facd 1:b35f2d1c:::100004342f9.000016b6:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:04:11.334694", > "age": 649.131646, > "duration": 13.615766, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 72256 > }, > "events": [ > { > "time": "2018-12-06 18:04:11.334694", > "event": "initiated" > }, > { > "time": "2018-12-06 18:04:11.334694", > "event": "header_read" > }, > { > "time": "2018-12-06 18:04:11.334696", > "event": "throttled" > }, > { > "time": "2018-12-06 18:04:11.340039", > "event": "all_read" > }, > { > "time": "2018-12-06 18:04:11.340076", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:04:11.340082", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:04:24.091372", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:04:24.091440", > "event": "started" > }, > { > "time": "2018-12-06 18:04:24.091490", > "event": "waiting for subops from 6" > }, > { > "time": "2018-12-06 18:04:24.359630", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:04:24.950426", > "event": "sub_op_commit_rec from 6" > }, > { > "time": "2018-12-06 18:04:24.950451", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:04:24.950460", > "event": "done" > } > ] > } > }, > { > "description": "osd_op(client.447197.1:75635 1.751883c0 1:03c118ae:::100004342f9.00001a19:head [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e75)", > "initiated_at": "2018-12-06 18:05:16.411107", > "age": 584.055234, > "duration": 10.290467, > "type_data": { > "flag_point": "commit sent; apply or cleanup", > "client_info": { > "client": "client.447197", > "client_addr": "10.0.30.110:0/3181790720", > "tid": 75635 > }, > "events": [ > { > "time": "2018-12-06 18:05:16.411107", > "event": "initiated" > }, > { > "time": "2018-12-06 18:05:16.411107", > "event": "header_read" > }, > { > "time": "2018-12-06 18:05:16.411108", > "event": "throttled" > }, > { > "time": "2018-12-06 18:05:16.416864", > "event": "all_read" > }, > { > "time": "2018-12-06 18:05:16.416903", > "event": "dispatched" > }, > { > "time": "2018-12-06 18:05:16.416909", > "event": "queued_for_pg" > }, > { > "time": "2018-12-06 18:05:19.383296", > "event": "reached_pg" > }, > { > "time": "2018-12-06 18:05:19.383394", > "event": "started" > }, > { > "time": "2018-12-06 18:05:19.383443", > "event": "waiting for subops from 5" > }, > { > "time": "2018-12-06 18:05:19.957547", > "event": "op_commit" > }, > { > "time": "2018-12-06 18:05:26.701524", > "event": "sub_op_commit_rec from 5" > }, > { > "time": "2018-12-06 18:05:26.701557", > "event": "commit_sent" > }, > { > "time": "2018-12-06 18:05:26.701574", > "event": "done" > } > ] > } > } > ] > } > > > — > Best Regards > Li, Ning > > > >> On Dec 6, 2018, at 14:31, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >> >> On Thu, Dec 6, 2018 at 2:24 PM Li,Ning <lining916740672@xxxxxxxxxx> wrote: >>> >>> Hi, >>> >>> Overload on the storage nodes that run OSDs is normal during the whole copy. >>> >>> The memory pressure is on the client side. I think it may not work to turn the dirty_ratio and dirty_background_ratio to small on the ceph storage server side. >>> >>> >>> Anyway,I will have a try. >>> >>> — >>> Best Regards >>> Li, Ning >>> >>> >>> >>>> On Dec 6, 2018, at 11:41, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >>>> >>>> On Wed, Dec 5, 2018 at 2:33 PM NingLi <lining916740672@xxxxxxxxxx> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> We found that some process writing cephfs will hang for a long time (> 120s) when uploading(scp/rsync) large files(totally 50G ~ 300G)to the app node's cephfs mountpoint. >>>>> >>>>> This problem is not always reproduciable. But when this problem occurs, the web(nginx) or some other services can’t be access, it’s embarrassing. >>>>> >>>>> >>>>> Please give some tips. >>>>> >>>>> >>>>> We are using ceph version 13.2.1 and 3.10.0-693.el7.x86_64 kernel. >>>>> >>>>> >>>>> When this problem occurs, the memory pressure is a bit high. >>>>> # free -h >>>>> total used free shared buff/cache available >>>>> Mem: 30G 11G 230M 3.3M 18G 18G >>>>> Swap: 0B 0B 0B >>>>> >>>>> # mpstat -P ALL >>>>> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle >>>>> Average: all 3.80 0.04 1.46 55.12 0.00 0.08 0.00 0.00 0.00 39.49 >>>>> Average: 0 1.52 0.00 0.51 97.98 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> Average: 1 13.07 0.00 1.51 0.00 0.00 0.00 0.00 0.00 0.00 85.43 >>>>> Average: 2 3.00 0.00 3.00 0.50 0.00 0.00 0.00 0.00 0.00 93.50 >>>>> Average: 3 2.50 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 95.50 >>>>> Average: 4 1.50 0.50 1.50 96.00 0.00 0.50 0.00 0.00 0.00 0.00 >>>>> Average: 5 0.50 0.00 1.01 32.16 0.00 0.00 0.00 0.00 0.00 66.33 >>>>> Average: 6 2.49 0.00 2.49 0.00 0.00 0.50 0.00 0.00 0.00 94.53 >>>>> Average: 7 1.52 0.00 0.51 97.98 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> Average: 8 1.50 0.00 1.50 97.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> Average: 9 14.07 0.00 2.01 46.23 0.00 0.00 0.00 0.00 0.00 37.69 >>>>> Average: 10 2.00 0.00 0.50 97.50 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> Average: 11 2.01 0.00 1.01 96.98 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> >>>>> We can see some process in state ‘D’ for a long time. here are 2 fragments in dmesg. >>>>> >>>> >>>> try setting vm.dirty_ration and vm.dirty_background_ration of machines >>>> that run OSDs to smaller value. >>>> >>>>> [Fri Nov 30 15:10:43 2018] INFO: task java:11662 blocked for more than 120 seconds. >>>>> [Fri Nov 30 15:10:43 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> [Fri Nov 30 15:10:43 2018] java D ffffffff816a4180 0 11662 10409 0x00000180 >>>>> [Fri Nov 30 15:10:43 2018] ffff8805f4febbd0 0000000000000082 ffff8807e2edcf10 ffff8805f4febfd8 >>>>> [Fri Nov 30 15:10:43 2018] ffff8805f4febfd8 ffff8805f4febfd8 ffff8807e2edcf10 ffff88086df56cc0 >>>>> [Fri Nov 30 15:10:43 2018] 0000000000000000 7fffffffffffffff ffff88087ffc3be8 ffffffff816a4180 >>>>> [Fri Nov 30 15:10:43 2018] Call Trace: >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff816a4180>] ? bit_wait+0x50/0x50 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff816a6059>] schedule+0x29/0x70 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff816a3b69>] schedule_timeout+0x239/0x2c0 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff81190541>] ? pagevec_lookup_tag+0x21/0x30 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffffc082b598>] ? ceph_writepages_start+0x1298/0x1520 [ceph] >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff810e939c>] ? ktime_get_ts64+0x4c/0xf0 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff816a4180>] ? bit_wait+0x50/0x50 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff816a56dd>] io_schedule_timeout+0xad/0x130 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff816a5778>] io_schedule+0x18/0x20 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff816a4191>] bit_wait_io+0x11/0x50 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff816a3cb5>] __wait_on_bit+0x65/0x90 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff81181cb1>] wait_on_page_bit+0x81/0xa0 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff810b19d0>] ? wake_bit_function+0x40/0x40 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff81181de1>] __filemap_fdatawait_range+0x111/0x190 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff811b0d79>] ? do_numa_page+0x159/0x1e0 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff8118f42e>] ? do_writepages+0x1e/0x40 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff81181e74>] filemap_fdatawait_range+0x14/0x30 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff81183db6>] filemap_write_and_wait_range+0x56/0x90 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffffc083326c>] ceph_fsync+0x6c/0x510 [ceph] >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff81137273>] ? __secure_computing+0x73/0x240 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff8111f5c6>] ? __audit_syscall_exit+0x1e6/0x280 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff81231d85>] do_fsync+0x65/0xa0 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff81232083>] SyS_fdatasync+0x13/0x20 >>>>> [Fri Nov 30 15:10:43 2018] [<ffffffff816b1d4e>] tracesys+0xd9/0xde >>>>> >>>>> >>>>> [Fri Nov 30 15:12:43 2018] INFO: task prometheus:9919 blocked for more than 120 seconds. >>>>> [Fri Nov 30 15:12:43 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> [Fri Nov 30 15:12:43 2018] prometheus D ffff8807e87dbad8 0 9919 9879 0x00000080 >>>>> [Fri Nov 30 15:12:43 2018] ffff8807e87db970 0000000000000082 ffff8803e273cf10 ffff8807e87dbfd8 >>>>> [Fri Nov 30 15:12:43 2018] ffff8807e87dbfd8 ffff8807e87dbfd8 ffff8803e273cf10 ffff88046e716cc0 >>>>> [Fri Nov 30 15:12:43 2018] 0000000000000000 7fffffffffffffff ffffffff816a4180 ffff8807e87dbad8 >>>>> [Fri Nov 30 15:12:43 2018] Call Trace: >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff816a4180>] ? bit_wait+0x50/0x50 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff816a6059>] schedule+0x29/0x70 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff816a3b69>] schedule_timeout+0x239/0x2c0 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff81090007>] ? local_bh_enable+0x17/0x20 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff816a4180>] ? bit_wait+0x50/0x50 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff816a56dd>] io_schedule_timeout+0xad/0x130 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff816a5778>] io_schedule+0x18/0x20 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff816a4191>] bit_wait_io+0x11/0x50 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff816a3eaf>] __wait_on_bit_lock+0x5f/0xc0 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff81182094>] __lock_page+0x74/0x90 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff810b19d0>] ? wake_bit_function+0x40/0x40 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff81182a34>] __find_lock_page+0x54/0x70 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff811836e2>] grab_cache_page_write_begin+0x62/0xd0 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffffc082a203>] ceph_write_begin+0x43/0xe0 [ceph] >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff811824ce>] generic_file_buffered_write+0x11e/0x2a0 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffffc08259c4>] ceph_aio_write+0xac4/0xc60 [ceph] >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff8160417a>] ? inet_recvmsg+0x7a/0xa0 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff811feb3d>] do_sync_write+0x8d/0xd0 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff811ff5fd>] vfs_write+0xbd/0x1e0 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff8120040f>] SyS_write+0x7f/0xe0 >>>>> [Fri Nov 30 15:12:43 2018] [<ffffffff816b1b49>] system_call_fastpath+0x16/0x1b >>>>> >>>>> >>>>> But the cephfs status shows healthy. >>>>> # ceph -s >>>>> cluster: >>>>> id: b27bc9a3-2e32-4782-b60f-b7901b0b26e5 >>>>> health: HEALTH_OK >>>>> >>>>> services: >>>>> mon: 2 daemons, quorum skyaxe-storage-0,skyaxe-storage-1 >>>>> mgr: skyaxe-storage-0(active), standbys: skyaxe-storage-1 >>>>> mds: cephfs-1/1/1 up {0=skyaxe-storage-1=up:active}, 1 up:standby >>>>> osd: 8 osds: 8 up, 8 in >>>>> >>>>> data: >>>>> pools: 2 pools, 512 pgs >>>>> objects: 599.7 k objects, 876 GiB >>>>> usage: 2.6 TiB used, 13 TiB / 15 TiB avail >>>>> pgs: 512 active+clean >>>>> >>>>> io: >>>>> client: 3.7 KiB/s wr, 0 op/s rd, 1 op/s wr >>>>> >>>>> >>>>> We can see some unfinished requests in osdc. >>>>> >>>>> # cat /sys/kernel/debug/ceph/xxx/osdc >>>>> REQUESTS 98 homeless 0 >>>>> 8529 osd1 1.1bfdee5 [1,7]/1 [1,7]/1 100000011be.00000000 0x400024 1 0'0 write >>>>> 8532 osd1 1.2d81c338 [1,6]/1 [1,6]/1 100000011c0.00000000 0x400024 1 0'0 write >>>>> 8536 osd1 1.49324f26 [1,4]/1 [1,4]/1 100000011c4.00000000 0x400024 1 0'0 write >>>>> 8542 osd1 1.1403106f [1,7]/1 [1,7]/1 100000011c8.00000000 0x400024 1 0'0 write >>>>> 8549 osd1 1.ea273113 [1,4]/1 [1,4]/1 100000011ce.00000000 0x400024 1 0'0 write >>>>> 8562 osd1 1.f2d68e9a [1,5]/1 [1,5]/1 100000011d8.00000000 0x400024 1 0'0 write >>>>> 8566 osd1 1.1f212283 [1,4]/1 [1,4]/1 100000011db.00000000 0x400024 1 0'0 write >>>>> 8568 osd1 1.2e1984e7 [1,5]/1 [1,5]/1 100000011de.00000000 0x400024 1 0'0 write >>>>> 8576 osd1 1.da7e63d0 [1,5]/1 [1,5]/1 100000011e1.00000000 0x400024 1 0'0 write >>>>> 8581 osd1 1.3b0d280 [1,4]/1 [1,4]/1 100000011e8.00000000 0x400024 1 0'0 write >>>>> 8582 osd1 1.80227f0b [1,5]/1 [1,5]/1 100000011ea.00000000 0x400024 1 0'0 write >>>>> 8593 osd1 1.1efd9391 [1,6]/1 [1,6]/1 100000011ef.00000000 0x400024 1 0'0 write >>>>> 8594 osd1 1.3db6e880 [1,4]/1 [1,4]/1 100000011f0.00000000 0x400024 1 0'0 write >>>>> 8611 osd1 1.6d20c41d [1,4]/1 [1,4]/1 1000000028c.00000000 0x400024 1 0'0 write >>>>> 8612 osd1 1.6d20c41d [1,4]/1 [1,4]/1 1000000028c.00000000 0x400024 1 0'0 write >>>>> 8616 osd1 1.f4d90a4d [1,7]/1 [1,7]/1 10000001206.00000000 0x400024 1 0'0 write >>>>> 8619 osd1 1.740514fe [1,5]/1 [1,5]/1 10000001208.00000000 0x400024 1 0'0 write >>>>> 8630 osd1 1.d5fbabe0 [1,6]/1 [1,6]/1 10000001216.00000000 0x400024 1 0'0 write >>>>> 8631 osd1 1.daf28f83 [1,4]/1 [1,4]/1 10000001217.00000000 0x400024 1 0'0 write >>>>> 8635 osd1 1.d29e01a6 [1,5]/1 [1,5]/1 1000000121a.00000000 0x400024 1 0'0 write >>>>> 8640 osd1 1.2de0ff63 [1,7]/1 [1,7]/1 10000001220.00000000 0x400024 1 0'0 write >>>>> 8658 osd1 1.2f91206f [1,7]/1 [1,7]/1 1000000122c.00000000 0x400064 1 0'0 write,startsync >>>>> ... >>>>> >> >> how long did these osd requests take? If some requests take too long, >> use 'ceph daemon osd.x dump dump_blocked_ops/dump_ops_in_flight' to >> check >> >> >>>> >>>>> Here is our ceph config: >>>>> >>>>> # cat /etc/ceph/ceph.conf >>>>> [global] >>>>> fsid = b27bc9a3-2e32-4782-b60f-b7901b0b26e5 >>>>> mon_initial_members = skyaxe-storage-0, skyaxe-storage-1 >>>>> mon_host = 10.0.30.111,10.0.30.112 >>>>> auth_cluster_required = cephx >>>>> auth_service_required = cephx >>>>> auth_client_required = cephx >>>>> >>>>> osd_pool_default_size = 2 >>>>> public_network = 10.0.30.0/24 >>>>> cluster_network = 10.0.40.0/24 >>>>> osd_pool_default_min_size = 1 >>>>> >>>>> max_open_files = 131072 >>>>> >>>>> [osd] >>>>> osd_journal_size = 56320 >>>>> >>>>> journal_max_write_bytes = 1073741824 >>>>> journal_max_write_entries = 10000 >>>>> journal_queue_max_ops = 50000 >>>>> journal_queue_max_bytes = 10737418240 >>>>> >>>>> osd_mkfs_type = xfs >>>>> osd_mkfs_options_xfs = -f >>>>> osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" >>>>> >>>>> osd_max_write_size = 512 >>>>> osd_client_message_size_cap = 2147483648 >>>>> osd_deep_scrub_stride = 131072 >>>>> osd_op_threads = 8 >>>>> osd_disk_threads = 4 >>>>> osd_map_cache_size = 1024 >>>>> osd_map_cache_bl_size = 128 >>>>> osd_recovery_op_priority = 4 >>>>> osd_recovery_max_active = 10 >>>>> osd_max_backfills = 4 >>>>> >>>>> — >>>>> Best Regards >>>>> Li, Ning >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com