-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 OK, looping in ceph-devel to see if I can get some more eyes. I've extracted what I think are important entries from the logs for the first blocked request. NTP is running all the servers so the logs should be close in terms of time. Logs for 12:50 to 13:00 are available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0 In the logs I can see that osd.17 dispatches the I/O to osd.13 and osd.16 almost silmutaniously. osd.16 seems to get the I/O right away, but for some reason osd.13 doesn't get the message until 53 seconds later. osd.17 seems happy to just wait and doesn't resend the data (well, I'm not 100% sure how to tell which entries are the actual data transfer). It looks like osd.17 is receiving responses to start the communication with osd.13, but the op is not acknowledged until almost a minute later. To me it seems that the message is getting received but not passed to another thread right away or something. This test was done with an idle cluster, a single fio client (rbd engine) with a single thread. The OSD servers are almost 100% idle during these blocked I/O requests. I think I'm at the end of my troubleshooting, so I can use some help. Single Test started about 2015-09-22 12:52:36 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.439150 secs 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 : cluster [WRN] slow request 30.439150 seconds old, received at 2015-09-22 12:55:06.487451: osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785) currently waiting for subops from 13,16 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 30.379680 secs 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster [WRN] slow request 30.291520 seconds old, received at 2015-09-22 12:55:06.406303: osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785) currently waiting for subops from 13,17 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster [WRN] slow request 30.379680 seconds old, received at 2015-09-22 12:55:06.318144: osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785) currently waiting for subops from 13,14 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.954212 secs 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 : cluster [WRN] slow request 30.954212 seconds old, received at 2015-09-22 12:57:33.044003: osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785) currently waiting for subops from 16,17 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.704367 secs 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 : cluster [WRN] slow request 30.704367 seconds old, received at 2015-09-22 12:57:33.055404: osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785) currently waiting for subops from 13,17 Server IP addr OSD nodev - 192.168.55.11 - 12 nodew - 192.168.55.12 - 13 nodex - 192.168.55.13 - 16 nodey - 192.168.55.14 - 17 nodez - 192.168.55.15 - 14 nodezz - 192.168.55.16 - 15 fio job: [rbd-test] readwrite=write blocksize=4M #runtime=60 name=rbd-test #readwrite=randwrite #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1 #rwmixread=72 #norandommap #size=1T #blocksize=4k ioengine=rbd rbdname=test2 pool=rbd clientname=admin iodepth=8 #numjobs=4 #thread #group_reporting #time_based #direct=1 #ramp_time=60 Thanks, -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.1.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7 FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB J3hS =0J7F -----END PGP SIGNATURE----- ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> Is there some way to tell in the logs that this is happening? > > You can search for the (mangled) name _split_collection >> I'm not >> seeing much I/O, CPU usage during these times. Is there some way to >> prevent the splitting? Is there a negative side effect to doing so? > > Bump up the split and merge thresholds. You can search the list for > this, it was discussed not too long ago. > >> We've had I/O block for over 900 seconds and as soon as the sessions >> are aborted, they are reestablished and complete immediately. >> >> The fio test is just a seq write, starting it over (rewriting from the >> beginning) is still causing the issue. I was suspect that it is not >> having to create new file and therefore split collections. This is on >> my test cluster with no other load. > > Hmm, that does make it seem less likely if you're really not creating > new objects, if you're actually running fio in such a way that it's > not allocating new FS blocks (this is probably hard to set up?). > >> >> I'll be doing a lot of testing today. Which log options and depths >> would be the most helpful for tracking this issue down? > > If you want to go log diving "debug osd = 20", "debug filestore = 20", > "debug ms = 1" are what the OSD guys like to see. That should spit out > everything you need to track exactly what each Op is doing. > -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com