-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 We were able to only get ~17Gb out of the XL710 (heavily tweaked) until we went to the 4.x kernel where we got ~36Gb (no tweaking). It seems that there were some major reworks in the network handling in the kernel to efficiently handle that network rate. If I remember right we also saw a drop in CPU utilization. I'm starting to think that we did see packet loss while congesting our ISLs in our initial testing, but we could not tell where the dropping was happening. We saw some on the switches, but it didn't seem to be bad if we weren't trying to congest things. We probably already saw this issue, just didn't know it. - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson wrote: > FWIW, we've got some 40GbE Intel cards in the community performance cluster > on a Mellanox 40GbE switch that appear (knock on wood) to be running fine > with 3.10.0-229.7.2.el7.x86_64. We did get feedback from Intel that older > drivers might cause problems though. > > Here's ifconfig from one of the nodes: > > ens513f1: flags=4163 mtu 1500 > inet 10.0.10.101 netmask 255.255.255.0 broadcast 10.0.10.255 > inet6 fe80::6a05:caff:fe2b:7ea1 prefixlen 64 scopeid 0x20 > ether 68:05:ca:2b:7e:a1 txqueuelen 1000 (Ethernet) > RX packets 169232242875 bytes 229346261232279 (208.5 TiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 153491686361 bytes 203976410836881 (185.5 TiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > Mark > > > On 09/23/2015 01:48 PM, Robert LeBlanc wrote: >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> OK, here is the update on the saga... >> >> I traced some more of blocked I/Os and it seems that communication >> between two hosts seemed worse than others. I did a two way ping flood >> between the two hosts using max packet sizes (1500). After 1.5M >> packets, no lost pings. Then then had the ping flood running while I >> put Ceph load on the cluster and the dropped pings started increasing >> after stopping the Ceph workload the pings stopped dropping. >> >> I then ran iperf between all the nodes with the same results, so that >> ruled out Ceph to a large degree. I then booted in the the >> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there >> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really >> need the network enhancements in the 4.x series to work well. >> >> Does this sound familiar to anyone? I'll probably start bisecting the >> kernel to see where this issue in introduced. Both of the clusters >> with this issue are running 4.x, other than that, they are pretty >> differing hardware and network configs. >> >> Thanks, >> -----BEGIN PGP SIGNATURE----- >> Version: Mailvelope v1.1.0 >> Comment: https://www.mailvelope.com >> >> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr >> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l >> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V >> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j >> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv >> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW >> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE >> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe >> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok >> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98 >> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX >> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X >> 4OEo >> =P33I >> -----END PGP SIGNATURE----- >> ---------------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc >> wrote: >>> >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA256 >>> >>> This is IPoIB and we have the MTU set to 64K. There was some issues >>> pinging hosts with "No buffer space available" (hosts are currently >>> configured for 4GB to test SSD caching rather than page cache). I >>> found that MTU under 32K worked reliable for ping, but still had the >>> blocked I/O. >>> >>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing >>> the blocked I/O. >>> - ---------------- >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> >>> >>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil wrote: >>>> >>>> On Tue, 22 Sep 2015, Samuel Just wrote: >>>>> >>>>> I looked at the logs, it looks like there was a 53 second delay >>>>> between when osd.17 started sending the osd_repop message and when >>>>> osd.13 started reading it, which is pretty weird. Sage, didn't we >>>>> once see a kernel issue which caused some messages to be mysteriously >>>>> delayed for many 10s of seconds? >>>> >>>> >>>> Every time we have seen this behavior and diagnosed it in the wild it >>>> has >>>> been a network misconfiguration. Usually related to jumbo frames. >>>> >>>> sage >>>> >>>> >>>>> >>>>> What kernel are you running? >>>>> -Sam >>>>> >>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc wrote: >>>>>> >>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>> Hash: SHA256 >>>>>> >>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've >>>>>> extracted what I think are important entries from the logs for the >>>>>> first blocked request. NTP is running all the servers so the logs >>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are >>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz >>>>>> >>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client >>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13 >>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16 >>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17 >>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16 >>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0 >>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec >>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17 >>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13 >>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0 >>>>>> >>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and >>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away, >>>>>> but for some reason osd.13 doesn't get the message until 53 seconds >>>>>> later. osd.17 seems happy to just wait and doesn't resend the data >>>>>> (well, I'm not 100% sure how to tell which entries are the actual data >>>>>> transfer). >>>>>> >>>>>> It looks like osd.17 is receiving responses to start the communication >>>>>> with osd.13, but the op is not acknowledged until almost a minute >>>>>> later. To me it seems that the message is getting received but not >>>>>> passed to another thread right away or something. This test was done >>>>>> with an idle cluster, a single fio client (rbd engine) with a single >>>>>> thread. >>>>>> >>>>>> The OSD servers are almost 100% idle during these blocked I/O >>>>>> requests. I think I'm at the end of my troubleshooting, so I can use >>>>>> some help. >>>>>> >>>>>> Single Test started about >>>>>> 2015-09-22 12:52:36 >>>>>> >>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 : >>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > >>>>>> 30.439150 secs >>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 : >>>>>> cluster [WRN] slow request 30.439150 seconds old, received at >>>>>> 2015-09-22 12:55:06.487451: >>>>>> osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545 >>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write >>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785) >>>>>> currently waiting for subops from 13,16 >>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster >>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for > >>>>>> 30.379680 secs >>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster >>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22 >>>>>> 12:55:06.406303: >>>>>> osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541 >>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write >>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785) >>>>>> currently waiting for subops from 13,17 >>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster >>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22 >>>>>> 12:55:06.318144: >>>>>> osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f >>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write >>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785) >>>>>> currently waiting for subops from 13,14 >>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 : >>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > >>>>>> 30.954212 secs >>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 : >>>>>> cluster [WRN] slow request 30.954212 seconds old, received at >>>>>> 2015-09-22 12:57:33.044003: >>>>>> osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d >>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write >>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785) >>>>>> currently waiting for subops from 16,17 >>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 : >>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > >>>>>> 30.704367 secs >>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 : >>>>>> cluster [WRN] slow request 30.704367 seconds old, received at >>>>>> 2015-09-22 12:57:33.055404: >>>>>> osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e >>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write >>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785) >>>>>> currently waiting for subops from 13,17 >>>>>> >>>>>> Server IP addr OSD >>>>>> nodev - 192.168.55.11 - 12 >>>>>> nodew - 192.168.55.12 - 13 >>>>>> nodex - 192.168.55.13 - 16 >>>>>> nodey - 192.168.55.14 - 17 >>>>>> nodez - 192.168.55.15 - 14 >>>>>> nodezz - 192.168.55.16 - 15 >>>>>> >>>>>> fio job: >>>>>> [rbd-test] >>>>>> readwrite=write >>>>>> blocksize=4M >>>>>> #runtime=60 >>>>>> name=rbd-test >>>>>> #readwrite=randwrite >>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1 >>>>>> #rwmixread=72 >>>>>> #norandommap >>>>>> #size=1T >>>>>> #blocksize=4k >>>>>> ioengine=rbd >>>>>> rbdname=test2 >>>>>> pool=rbd >>>>>> clientname=admin >>>>>> iodepth=8 >>>>>> #numjobs=4 >>>>>> #thread >>>>>> #group_reporting >>>>>> #time_based >>>>>> #direct=1 >>>>>> #ramp_time=60 >>>>>> >>>>>> >>>>>> Thanks, >>>>>> -----BEGIN PGP SIGNATURE----- >>>>>> Version: Mailvelope v1.1.0 >>>>>> Comment: https://www.mailvelope.com >>>>>> >>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z >>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu >>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl >>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB >>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7 >>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF >>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI >>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn >>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq >>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL >>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO >>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB >>>>>> J3hS >>>>>> =0J7F >>>>>> -----END PGP SIGNATURE----- >>>>>> ---------------- >>>>>> Robert LeBlanc >>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>>>>> >>>>>> >>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum wrote: >>>>>>> >>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc wrote: >>>>>>>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>>> Hash: SHA256 >>>>>>>> >>>>>>>> Is there some way to tell in the logs that this is happening? >>>>>>> >>>>>>> >>>>>>> You can search for the (mangled) name _split_collection >>>>>>>> >>>>>>>> I'm not >>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to >>>>>>>> prevent the splitting? Is there a negative side effect to doing so? >>>>>>> >>>>>>> >>>>>>> Bump up the split and merge thresholds. You can search the list for >>>>>>> this, it was discussed not too long ago. >>>>>>> >>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions >>>>>>>> are aborted, they are reestablished and complete immediately. >>>>>>>> >>>>>>>> The fio test is just a seq write, starting it over (rewriting from >>>>>>>> the >>>>>>>> beginning) is still causing the issue. I was suspect that it is not >>>>>>>> having to create new file and therefore split collections. This is >>>>>>>> on >>>>>>>> my test cluster with no other load. >>>>>>> >>>>>>> >>>>>>> Hmm, that does make it seem less likely if you're really not creating >>>>>>> new objects, if you're actually running fio in such a way that it's >>>>>>> not allocating new FS blocks (this is probably hard to set up?). >>>>>>> >>>>>>>> >>>>>>>> I'll be doing a lot of testing today. Which log options and depths >>>>>>>> would be the most helpful for tracking this issue down? >>>>>>> >>>>>>> >>>>>>> If you want to go log diving "debug osd = 20", "debug filestore = >>>>>>> 20", >>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit >>>>>>> out >>>>>>> everything you need to track exactly what each Op is doing. >>>>>>> -Greg >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>>> >>> >>> -----BEGIN PGP SIGNATURE----- >>> Version: Mailvelope v1.1.0 >>> Comment: https://www.mailvelope.com >>> >>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG >>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf >>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV >>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc >>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B >>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz >>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj >>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx >>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7 >>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H >>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg >>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp >>> gcZm >>> =CjwB >>> -----END PGP SIGNATURE----- >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.1.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7 S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2 lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0 dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo ae22 =AX+L -----END PGP SIGNATURE----- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com