Sagi, Here is an example of the different types of tests. This was only on one kernel. The first two are to set a baseline. The lines starting with buffer is using fio with direct=0, the lines starting with direct is fio with direct=1. The lines starting with block is fio running against a raw block deice (technically 40 partitions on a single drive) with direct=0. I also reduced the tests to only test one path per port instead of four like before. # /root/run_path_tests.sh check-paths #### Test all iSER paths individually #### 4.5.0-rc5-5adabdd1-00023-g5adabdd buffer;sdc;10.218.128.17;3815778;953944;21984 buffer;sdd;10.219.128.17;3743744;935936;22407 buffer;sde;10.220.128.17;4915392;1228848;17066 direct;sdc;10.218.128.17;876644;219161;95690 direct;sdd;10.219.128.17;881684;220421;95143 direct;sde;10.220.128.17;892215;223053;94020 block;sdc;10.218.128.17;3890459;972614;21562 block;sdd;10.219.128.17;4127642;1031910;20323 block;sde;10.220.128.17;4939705;1234926;16982 # /root/run_path_tests.sh check-paths #### Test all iSER paths individually #### 4.5.0-rc5-5adabdd1-00023-g5adabdd buffer;sdc;10.218.128.17;3983572;995893;21058 buffer;sdd;10.219.128.17;3774231;943557;22226 buffer;sde;10.220.128.17;4856204;1214051;17274 direct;sdc;10.218.128.17;875820;218955;95780 direct;sdd;10.219.128.17;884072;221018;94886 direct;sde;10.220.128.17;902486;225621;92950 block;sdc;10.218.128.17;3790433;947608;22131 block;sdd;10.219.128.17;3860025;965006;21732 block;sde;10.220.128.17;4946404;1236601;16959 For the following test, I set the IRQ on the initiator using mlx_tune -p HIGH_THROUGHPUT with irqbalance disabled. # /root/run_path_tests.sh check-paths #### Test all iSER paths individually #### 4.5.0-rc5-5adabdd1-00023-g5adabdd buffer;sdc;10.218.128.17;3742742;935685;22413 buffer;sdd;10.219.128.17;3786327;946581;22155 buffer;sde;10.220.128.17;5009619;1252404;16745 direct;sdc;10.218.128.17;871942;217985;96206 direct;sdd;10.219.128.17;883467;220866;94951 direct;sde;10.220.128.17;901138;225284;93089 block;sdc;10.218.128.17;3911319;977829;21447 block;sdd;10.219.128.17;3758168;939542;22321 block;sde;10.220.128.17;4968377;1242094;16884 For the following test, I also set the IRQs on the target using mlx_tune -p HIGH_THROUGHPUT and disabled irqbalance. # /root/run_path_tests.sh check-paths #### Test all iSER paths individually #### 4.5.0-rc5-5adabdd1-00023-g5adabdd buffer;sdc;10.218.128.17;3804357;951089;22050 buffer;sdd;10.219.128.17;3767113;941778;22268 buffer;sde;10.220.128.17;4966612;1241653;16890 direct;sdc;10.218.128.17;879742;219935;95353 direct;sdd;10.219.128.17;886641;221660;94611 direct;sde;10.220.128.17;886857;221714;94588 block;sdc;10.218.128.17;3760864;940216;22305 block;sdd;10.219.128.17;3763564;940891;22289 block;sde;10.220.128.17;4965436;1241359;16894 It seems that mlx_tune marginally helps, but not really providing anything groundbreaking. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Jun 22, 2016 at 11:46 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > Sagi, > > Yes you are understanding the data correctly and what I'm seeing. I > think you are also seeing the confusion that I've been running into > trying to figure this out as well. As far as your questions about SRP, > the performance data is from the initiator and the CPU info is from > the target (all fio threads on the initiator were low CPU > utilization). > > I spent a good day tweaking the IRQ assignments (spreading IRQs to all > cores, spreading to all cores on the NUMA node the card is attached > to, and spreading to all non-hyperthreaded cores on the NUMA node). > None of these provided any substantial gains/detriments (irqbalance > was not running). I don't know if there is IRQ steering going on, but > in some cases with irqbalance not running the IRQs would get pinned > back to the previous core(s) and I'd have to set them again. I did not > use the Mellanox scripts, I just did it by hand based on the > documents/scripts. I also offlined all cores on the second NUMA node > which didn't help either. I got more performance gains with nomerges > (1 or 2 provided about the same gain, 2 slightly more) and the queue. > It seems that something in 1aaa57f5 was going right as both cards > performed very well without needing any IRQ fudging. > > I understand that there are many moving parts to try and figure this > out, it could be anywhere in the IB drivers, LIO, and even the SCSI > sub systems, RAM disk implementation or file system. However since the > performance is bouncing between cards, it seems it is unlikely > something very common (except when both cards show a loss/gain), but > as you mentioned, there doesn't seem to be any rhyme or reason to the > shifts. > > I haven't been using the straight block device in these tests, before > when I did, after one thread read the data, if another read that same > block it then started reading it from cache invalidating the test. I > could only saturate the path/port by highly threaded jobs, I may have > to partition out the disk for block testing. When I ran the tests > using direct I/O the performance was far lower and harder for me to > know when I was reaching the theoretical max of the card/links/PCIe. I > just may have my scripts run the three tests in succession. > > Thanks for looking at this. Please let me know what you think would be > most helpful so that I'm making the best use of your and my time. > > Thanks, > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Wed, Jun 22, 2016 at 10:21 AM, Sagi Grimberg <sagi@xxxxxxxxxxxx> wrote: >> Let me see if I get this correct: >> >>> 4.5.0_rc3_1aaa57f5_00399 >>> >>> sdc;10.218.128.17;4627942;1156985;18126 >>> sdf;10.218.202.17;4590963;1147740;18272 >>> sdk;10.218.203.17;4564980;1141245;18376 >>> sdn;10.218.204.17;4571946;1142986;18348 >>> sdd;10.219.128.17;4591717;1147929;18269 >>> sdi;10.219.202.17;4505644;1126411;18618 >>> sdg;10.219.203.17;4562001;1140500;18388 >>> sdl;10.219.204.17;4583187;1145796;18303 >>> sde;10.220.128.17;5511568;1377892;15220 >>> sdh;10.220.202.17;5515555;1378888;15209 >>> sdj;10.220.203.17;5609983;1402495;14953 >>> sdm;10.220.204.17;5509035;1377258;15227 >> >> >> In 1aaa57f5 you get on CIB ~115K IOPs per sd device >> and on CX3 you get around 140K IOPs per sd device. >> >>> >>> Mlx5_0;sde;3593013;898253;23347 100% CPU kworker/u69:2 >>> Mlx5_0;sdd;3588555;897138;23376 100% CPU kworker/u69:2 >>> Mlx4_0;sdc;3525662;881415;23793 100% CPU kworker/u68:0 >> >> >> Is this on the host or the target? >> >>> 4.5.0_rc5_7861728d_00001 >>> sdc;10.218.128.17;3747591;936897;22384 >>> sdf;10.218.202.17;3750607;937651;22366 >>> sdh;10.218.203.17;3750439;937609;22367 >>> sdn;10.218.204.17;3771008;942752;22245 >>> sde;10.219.128.17;3867678;966919;21689 >>> sdg;10.219.202.17;3781889;945472;22181 >>> sdk;10.219.203.17;3791804;947951;22123 >>> sdl;10.219.204.17;3795406;948851;22102 >>> sdd;10.220.128.17;5039110;1259777;16647 >>> sdi;10.220.202.17;4992921;1248230;16801 >>> sdj;10.220.203.17;5015610;1253902;16725 >>> Sdm;10.220.204.17;5087087;1271771;16490 >> >> >> In 7861728d you get on CIB ~95K IOPs per sd device >> and on CX3 you get around 125K IOPs per sd device. >> >> I don't see any difference in the code around iser/isert, >> in fact, I don't see any commit in drivers/infiniband >> >> >>> >>> Mlx5_0;sde;2930722;732680;28623 ~98% CPU kworker/u69:0 >>> Mlx5_0;sdd;2910891;727722;28818 ~98% CPU kworker/u69:0 >>> Mlx4_0;sdc;3263668;815917;25703 ~98% CPU kworker/u68:0 >> >> >> Again, host or target? >> >>> 4.5.0_rc5_f81bf458_00018 >>> sdb;10.218.128.17;5023720;1255930;16698 >>> sde;10.218.202.17;5016809;1254202;16721 >>> sdj;10.218.203.17;5021915;1255478;16704 >>> sdk;10.218.204.17;5021314;1255328;16706 >>> sdc;10.219.128.17;4984318;1246079;16830 >>> sdf;10.219.202.17;4986096;1246524;16824 >>> sdh;10.219.203.17;5043958;1260989;16631 >>> sdm;10.219.204.17;5032460;1258115;16669 >>> sdd;10.220.128.17;3736740;934185;22449 >>> sdg;10.220.202.17;3728767;932191;22497 >>> sdi;10.220.203.17;3752117;938029;22357 >>> Sdl;10.220.204.17;3763901;940975;22287 >> >> >> In f81bf458 you get on CIB ~125K IOPs per sd device >> and on CX3 you get around 93K IOPs per sd device which >> is the other way around? CIB is better than CX3? >> >> The commits in this gap are: >> f81bf458208e iser-target: Separate flows for np listeners and connections >> cma events >> aea92980601f iser-target: Add new state ISER_CONN_BOUND to isert_conn >> b89a7c25462b iser-target: Fix identification of login rx descriptor type >> >> None of those should affect the data-path. >> >>> >>> Srpt keeps crashing couldn't test >>> >>> 4.5.0_rc5_5adabdd1_00023 >>> Sdc;10.218.128.17;3726448;931612;22511 ~97% CPU kworker/u69:4 >>> sdf;10.218.202.17;3750271;937567;22368 >>> sdi;10.218.203.17;3749266;937316;22374 >>> sdj;10.218.204.17;3798844;949711;22082 >>> sde;10.219.128.17;3759852;939963;22311 ~97% CPU kworker/u69:4 >>> sdg;10.219.202.17;3772534;943133;22236 >>> sdl;10.219.203.17;3769483;942370;22254 >>> sdn;10.219.204.17;3790604;947651;22130 >>> sdd;10.220.128.17;5171130;1292782;16222 ~96% CPU kworker/u68:3 >>> sdh;10.220.202.17;5105354;1276338;16431 >>> sdk;10.220.203.17;4995300;1248825;16793 >>> sdm;10.220.204.17;4959564;1239891;16914 >> >> >> In 5adabdd1 you get on CIB ~94K IOPs per sd device >> and on CX3 you get around 130K IOPs per sd device >> which means you flipped again (very strange). >> >> The commits in this gap are: >> 5adabdd122e4 iser-target: Split and properly type the login buffer >> ed1083b251f0 iser-target: Remove ISER_RECV_DATA_SEG_LEN >> 26c7b673db57 iser-target: Remove impossible condition from isert_wait_conn >> 69c48846f1c7 iser-target: Remove redundant wait in release_conn >> 6d1fba0c2cc7 iser-target: Rework connection termination >> >> Again, none are suspected to implicate the data-plane. >> >>> Srpt crashes >>> >>> 4.5.0_rc5_07b63196_00027 >>> sdb;10.218.128.17;3606142;901535;23262 >>> sdg;10.218.202.17;3570988;892747;23491 >>> sdf;10.218.203.17;3576011;894002;23458 >>> sdk;10.218.204.17;3558113;889528;23576 >>> sdc;10.219.128.17;3577384;894346;23449 >>> sde;10.219.202.17;3575401;893850;23462 >>> sdj;10.219.203.17;3567798;891949;23512 >>> sdl;10.219.204.17;3584262;896065;23404 >>> sdd;10.220.128.17;4430680;1107670;18933 >>> sdh;10.220.202.17;4488286;1122071;18690 >>> sdi;10.220.203.17;4487326;1121831;18694 >>> sdm;10.220.204.17;4441236;1110309;18888 >> >> >> In 5adabdd1 you get on CIB ~89K IOPs per sd device >> and on CX3 you get around 112K IOPs per sd device >> >> The commits in this gap are: >> e3416ab2d156 iser-target: Kill the ->isert_cmd back pointer in struct >> iser_tx_desc >> d1ca2ed7dcf8 iser-target: Kill struct isert_rdma_wr >> 9679cc51eb13 iser-target: Convert to new CQ API >> >> Which do effect the data-path, but nothing that can explain >> a specific CIB issue. Moreover, the perf drop happened before that. >> >>> Srpt crashes >>> >>> 4.5.0_rc5_5e47f198_00036 >>> sdb;10.218.128.17;3519597;879899;23834 >>> sdi;10.218.202.17;3512229;878057;23884 >>> sdh;10.218.203.17;3518563;879640;23841 >>> sdk;10.218.204.17;3582119;895529;23418 >>> sdd;10.219.128.17;3550883;887720;23624 >>> sdj;10.219.202.17;3558415;889603;23574 >>> sde;10.219.203.17;3552086;888021;23616 >>> sdl;10.219.204.17;3579521;894880;23435 >>> sdc;10.220.128.17;4532912;1133228;18506 >>> sdf;10.220.202.17;4558035;1139508;18404 >>> sdg;10.220.203.17;4601035;1150258;18232 >>> sdm;10.220.204.17;4548150;1137037;18444 >> >> >> Same results, and no commit added so makes sense. >> >> >>> srpt crashes >>> >>> 4.6.2 vanilla default config >>> sde;10.218.128.17;3431063;857765;24449 >>> sdf;10.218.202.17;3360685;840171;24961 >>> sdi;10.218.203.17;3355174;838793;25002 >>> sdm;10.218.204.17;3360955;840238;24959 >>> sdd;10.219.128.17;3337288;834322;25136 >>> sdh;10.219.202.17;3327492;831873;25210 >>> sdj;10.219.203.17;3380867;845216;24812 >>> sdk;10.219.204.17;3418340;854585;24540 >>> sdc;10.220.128.17;4668377;1167094;17969 >>> sdg;10.220.202.17;4716675;1179168;17785 >>> sdl;10.220.203.17;4675663;1168915;17941 >>> sdn;10.220.204.17;4631519;1157879;18112 >>> >>> Mlx5_0;sde;3390021;847505;24745 ~98% CPU kworker/u69:3 >>> Mlx5_0;sdd;3207512;801878;26153 ~98% CPU kworker/u69:3 >>> Mlx4_0;sdc;2998072;749518;27980 ~98% CPU kworker/u68:0 >>> >>> 4.7.0_rc3_5edb5649 >>> sdc;10.218.128.17;3260244;815061;25730 >>> sdg;10.218.202.17;3405988;851497;24629 >>> sdh;10.218.203.17;3307419;826854;25363 >>> sdm;10.218.204.17;3430502;857625;24453 >>> sdi;10.219.128.17;3544282;886070;23668 >>> sdj;10.219.202.17;3412083;853020;24585 >>> sdk;10.219.203.17;3422385;855596;24511 >>> sdl;10.219.204.17;3444164;861041;24356 >>> sdb;10.220.128.17;4803646;1200911;17463 >>> sdd;10.220.202.17;4832982;1208245;17357 >>> sde;10.220.203.17;4809430;1202357;17442 >>> sdf;10.220.204.17;4808878;1202219;17444 >> >> >> >> Here there is a new rdma_rw api, which doesn't >> make a difference in performance (but no improvement >> also). >> >> >> ------------------ >> So all in all I still don't know what can be the root-cause >> here. >> >> You mentioned that you are running fio over a filesystem. Is >> it possible to run your tests directly over the block devices? And >> can you run the fio with DIRECT-IO? >> >> Also, usually iser, srp and other rdma ULPs are sensitive to >> the IRQ assignments of the HCA. An incorrect IRQ affinity assignment >> might bring all sorts of noise to performance tests. The normal >> practice to get the most out of the HCA is usually to spread the >> IRQ assignments linearly on all CPUs >> (https://community.mellanox.com/docs/DOC-1483). >> Did you perform any steps to spread IRQ interrupts? is irqbalance daemon >> on? >> >> It would be good to try and isolate the drop and make sure it >> is real and not randomly generated due to some noise in the form of >> IRQ assignments. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html