Re: Connect-IB not performing as well as ConnectX-3 with iSER

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Wed, 22 Jun 2016 11:46:52 -0600

Sagi,

Yes you are understanding the data correctly and what I'm seeing. I
think you are also seeing the confusion that I've been running into
trying to figure this out as well. As far as your questions about SRP,
the performance data is from the initiator and the CPU info is from
the target (all fio threads on the initiator were low CPU
utilization).

I spent a good day tweaking the IRQ assignments (spreading IRQs to all
cores, spreading to all cores on the NUMA node the card is attached
to, and spreading to all non-hyperthreaded cores on the NUMA node).
None of these provided any substantial gains/detriments (irqbalance
was not running). I don't know if there is IRQ steering going on, but
in some cases with irqbalance not running the IRQs would get pinned
back to the previous core(s) and I'd have to set them again. I did not
use the Mellanox scripts, I just did it by hand based on the
documents/scripts. I also offlined all cores on the second NUMA node
which didn't help either. I got more performance gains with nomerges
(1 or 2 provided about the same gain, 2 slightly more) and the queue.
It seems that something in 1aaa57f5 was going right as both cards
performed very well without needing any IRQ fudging.

I understand that there are many moving parts to try and figure this
out, it could be anywhere in the IB drivers, LIO, and even the SCSI
sub systems, RAM disk implementation or file system. However since the
performance is bouncing between cards, it seems it is unlikely
something very common (except when both cards show a loss/gain), but
as you mentioned, there doesn't seem to be any rhyme or reason to the
shifts.

I haven't been using the straight block device in these tests, before
when I did, after one thread read the data, if another read that same
block it then started reading it from cache invalidating the test. I
could only saturate the path/port by highly threaded jobs, I may have
to partition out the disk for block testing. When I ran the tests
using direct I/O the performance was far lower and harder for me to
know when I was reaching the theoretical max of the card/links/PCIe. I
just may have my scripts run the three tests in succession.

Thanks for looking at this. Please let me know what you think would be
most helpful so that I'm making the best use of your and my time.

Thanks,
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Jun 22, 2016 at 10:21 AM, Sagi Grimberg <sagi@xxxxxxxxxxxx> wrote:
> Let me see if I get this correct:
>
>> 4.5.0_rc3_1aaa57f5_00399
>>
>> sdc;10.218.128.17;4627942;1156985;18126
>> sdf;10.218.202.17;4590963;1147740;18272
>> sdk;10.218.203.17;4564980;1141245;18376
>> sdn;10.218.204.17;4571946;1142986;18348
>> sdd;10.219.128.17;4591717;1147929;18269
>> sdi;10.219.202.17;4505644;1126411;18618
>> sdg;10.219.203.17;4562001;1140500;18388
>> sdl;10.219.204.17;4583187;1145796;18303
>> sde;10.220.128.17;5511568;1377892;15220
>> sdh;10.220.202.17;5515555;1378888;15209
>> sdj;10.220.203.17;5609983;1402495;14953
>> sdm;10.220.204.17;5509035;1377258;15227
>
>
> In 1aaa57f5 you get on CIB ~115K IOPs per sd device
> and on CX3 you get around 140K IOPs per sd device.
>
>>
>> Mlx5_0;sde;3593013;898253;23347 100% CPU kworker/u69:2
>> Mlx5_0;sdd;3588555;897138;23376 100% CPU kworker/u69:2
>> Mlx4_0;sdc;3525662;881415;23793 100% CPU kworker/u68:0
>
>
> Is this on the host or the target?
>
>> 4.5.0_rc5_7861728d_00001
>> sdc;10.218.128.17;3747591;936897;22384
>> sdf;10.218.202.17;3750607;937651;22366
>> sdh;10.218.203.17;3750439;937609;22367
>> sdn;10.218.204.17;3771008;942752;22245
>> sde;10.219.128.17;3867678;966919;21689
>> sdg;10.219.202.17;3781889;945472;22181
>> sdk;10.219.203.17;3791804;947951;22123
>> sdl;10.219.204.17;3795406;948851;22102
>> sdd;10.220.128.17;5039110;1259777;16647
>> sdi;10.220.202.17;4992921;1248230;16801
>> sdj;10.220.203.17;5015610;1253902;16725
>> Sdm;10.220.204.17;5087087;1271771;16490
>
>
> In 7861728d you get on CIB ~95K IOPs per sd device
> and on CX3 you get around 125K IOPs per sd device.
>
> I don't see any difference in the code around iser/isert,
> in fact, I don't see any commit in drivers/infiniband
>
>
>>
>> Mlx5_0;sde;2930722;732680;28623 ~98% CPU kworker/u69:0
>> Mlx5_0;sdd;2910891;727722;28818 ~98% CPU kworker/u69:0
>> Mlx4_0;sdc;3263668;815917;25703 ~98% CPU kworker/u68:0
>
>
> Again, host or target?
>
>> 4.5.0_rc5_f81bf458_00018
>> sdb;10.218.128.17;5023720;1255930;16698
>> sde;10.218.202.17;5016809;1254202;16721
>> sdj;10.218.203.17;5021915;1255478;16704
>> sdk;10.218.204.17;5021314;1255328;16706
>> sdc;10.219.128.17;4984318;1246079;16830
>> sdf;10.219.202.17;4986096;1246524;16824
>> sdh;10.219.203.17;5043958;1260989;16631
>> sdm;10.219.204.17;5032460;1258115;16669
>> sdd;10.220.128.17;3736740;934185;22449
>> sdg;10.220.202.17;3728767;932191;22497
>> sdi;10.220.203.17;3752117;938029;22357
>> Sdl;10.220.204.17;3763901;940975;22287
>
>
> In f81bf458 you get on CIB ~125K IOPs per sd device
> and on CX3 you get around 93K IOPs per sd device which
> is the other way around? CIB is better than CX3?
>
> The commits in this gap are:
> f81bf458208e iser-target: Separate flows for np listeners and connections
> cma events
> aea92980601f iser-target: Add new state ISER_CONN_BOUND to isert_conn
> b89a7c25462b iser-target: Fix identification of login rx descriptor type
>
> None of those should affect the data-path.
>
>>
>> Srpt keeps crashing couldn't test
>>
>> 4.5.0_rc5_5adabdd1_00023
>> Sdc;10.218.128.17;3726448;931612;22511 ~97% CPU kworker/u69:4
>> sdf;10.218.202.17;3750271;937567;22368
>> sdi;10.218.203.17;3749266;937316;22374
>> sdj;10.218.204.17;3798844;949711;22082
>> sde;10.219.128.17;3759852;939963;22311 ~97% CPU kworker/u69:4
>> sdg;10.219.202.17;3772534;943133;22236
>> sdl;10.219.203.17;3769483;942370;22254
>> sdn;10.219.204.17;3790604;947651;22130
>> sdd;10.220.128.17;5171130;1292782;16222 ~96% CPU kworker/u68:3
>> sdh;10.220.202.17;5105354;1276338;16431
>> sdk;10.220.203.17;4995300;1248825;16793
>> sdm;10.220.204.17;4959564;1239891;16914
>
>
> In 5adabdd1 you get on CIB ~94K IOPs per sd device
> and on CX3 you get around 130K IOPs per sd device
> which means you flipped again (very strange).
>
> The commits in this gap are:
> 5adabdd122e4 iser-target: Split and properly type the login buffer
> ed1083b251f0 iser-target: Remove ISER_RECV_DATA_SEG_LEN
> 26c7b673db57 iser-target: Remove impossible condition from isert_wait_conn
> 69c48846f1c7 iser-target: Remove redundant wait in release_conn
> 6d1fba0c2cc7 iser-target: Rework connection termination
>
> Again, none are suspected to implicate the data-plane.
>
>> Srpt crashes
>>
>> 4.5.0_rc5_07b63196_00027
>> sdb;10.218.128.17;3606142;901535;23262
>> sdg;10.218.202.17;3570988;892747;23491
>> sdf;10.218.203.17;3576011;894002;23458
>> sdk;10.218.204.17;3558113;889528;23576
>> sdc;10.219.128.17;3577384;894346;23449
>> sde;10.219.202.17;3575401;893850;23462
>> sdj;10.219.203.17;3567798;891949;23512
>> sdl;10.219.204.17;3584262;896065;23404
>> sdd;10.220.128.17;4430680;1107670;18933
>> sdh;10.220.202.17;4488286;1122071;18690
>> sdi;10.220.203.17;4487326;1121831;18694
>> sdm;10.220.204.17;4441236;1110309;18888
>
>
> In 5adabdd1 you get on CIB ~89K IOPs per sd device
> and on CX3 you get around 112K IOPs per sd device
>
> The commits in this gap are:
> e3416ab2d156 iser-target: Kill the ->isert_cmd back pointer in struct
> iser_tx_desc
> d1ca2ed7dcf8 iser-target: Kill struct isert_rdma_wr
> 9679cc51eb13 iser-target: Convert to new CQ API
>
> Which do effect the data-path, but nothing that can explain
> a specific CIB issue. Moreover, the perf drop happened before that.
>
>> Srpt crashes
>>
>> 4.5.0_rc5_5e47f198_00036
>> sdb;10.218.128.17;3519597;879899;23834
>> sdi;10.218.202.17;3512229;878057;23884
>> sdh;10.218.203.17;3518563;879640;23841
>> sdk;10.218.204.17;3582119;895529;23418
>> sdd;10.219.128.17;3550883;887720;23624
>> sdj;10.219.202.17;3558415;889603;23574
>> sde;10.219.203.17;3552086;888021;23616
>> sdl;10.219.204.17;3579521;894880;23435
>> sdc;10.220.128.17;4532912;1133228;18506
>> sdf;10.220.202.17;4558035;1139508;18404
>> sdg;10.220.203.17;4601035;1150258;18232
>> sdm;10.220.204.17;4548150;1137037;18444
>
>
> Same results, and no commit added so makes sense.
>
>
>> srpt crashes
>>
>> 4.6.2 vanilla default config
>> sde;10.218.128.17;3431063;857765;24449
>> sdf;10.218.202.17;3360685;840171;24961
>> sdi;10.218.203.17;3355174;838793;25002
>> sdm;10.218.204.17;3360955;840238;24959
>> sdd;10.219.128.17;3337288;834322;25136
>> sdh;10.219.202.17;3327492;831873;25210
>> sdj;10.219.203.17;3380867;845216;24812
>> sdk;10.219.204.17;3418340;854585;24540
>> sdc;10.220.128.17;4668377;1167094;17969
>> sdg;10.220.202.17;4716675;1179168;17785
>> sdl;10.220.203.17;4675663;1168915;17941
>> sdn;10.220.204.17;4631519;1157879;18112
>>
>> Mlx5_0;sde;3390021;847505;24745 ~98% CPU kworker/u69:3
>> Mlx5_0;sdd;3207512;801878;26153 ~98% CPU kworker/u69:3
>> Mlx4_0;sdc;2998072;749518;27980 ~98% CPU kworker/u68:0
>>
>> 4.7.0_rc3_5edb5649
>> sdc;10.218.128.17;3260244;815061;25730
>> sdg;10.218.202.17;3405988;851497;24629
>> sdh;10.218.203.17;3307419;826854;25363
>> sdm;10.218.204.17;3430502;857625;24453
>> sdi;10.219.128.17;3544282;886070;23668
>> sdj;10.219.202.17;3412083;853020;24585
>> sdk;10.219.203.17;3422385;855596;24511
>> sdl;10.219.204.17;3444164;861041;24356
>> sdb;10.220.128.17;4803646;1200911;17463
>> sdd;10.220.202.17;4832982;1208245;17357
>> sde;10.220.203.17;4809430;1202357;17442
>> sdf;10.220.204.17;4808878;1202219;17444
>
>
>
> Here there is a new rdma_rw api, which doesn't
> make a difference in performance (but no improvement
> also).
>
>
> ------------------
> So all in all I still don't know what can be the root-cause
> here.
>
> You mentioned that you are running fio over a filesystem. Is
> it possible to run your tests directly over the block devices? And
> can you run the fio with DIRECT-IO?
>
> Also, usually iser, srp and other rdma ULPs are sensitive to
> the IRQ assignments of the HCA. An incorrect IRQ affinity assignment
> might bring all sorts of noise to performance tests. The normal
> practice to get the most out of the HCA is usually to spread the
> IRQ assignments linearly on all CPUs
> (https://community.mellanox.com/docs/DOC-1483).
> Did you perform any steps to spread IRQ interrupts? is irqbalance daemon
> on?
>
> It would be good to try and isolate the drop and make sure it
> is real and not randomly generated due to some noise in the form of
> IRQ assignments.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html