FUJITA Tomonori wrote:
[...] this improves the performance. One target box exports four Intel SSD drivers as four logical units to one initiator boxes with 10GbE. The initiator runs disktest against each logical unit. The initiator gets about 500 MB/s in total. Running four tgt process (with "-C" option) produces about 850 MB/s in total. This patch also gives about 850MB/s. Seems that one process can handle 10GbE I/Os, however, it's not enough to handle 10GbE _and_ signalfd load generated by fast disk I/Os.
Hi Tomo,
Reading your email, I wasn't sure whether under the 1st test each one of
the luns was exported through a different iscsi target (and all four
targets by the same process) or all luns by the same iscsi target?
With a session being set per target and possible per-session bottlenecks
existing in both the initiator and the target side, I think the
performance aspect of the change should be based here on at least three
numbers, e.g <one process / one target / four luns> vs <one proc / four
targets / one lun each> vs applying the patch and then <four tgt
pthreads / one target/lun associated with each of them>. If you want to
dig further you can also test with only two target pthreads.
Using multiple tgt processes, from some runs I made, I saw that the
performance difference is notable when one goes from one to two target
procs and later on the additions of target processes has less notable
effect. More complexities come into play when you are hitting a
bottleneck on the initiator side and then you have to throw another
initiator in ...
One more aspect of adding more and more target processes or pthreads, is
the CPU contention caused by the per target/main tgt thread, e.g the one
that interacts with the network and reaps the completions from the
backing-store. When SSD is used, typically or at least in many cases I
believe that the SSD provider software stack uses (kernel) threads which
maintain the look-aside tables for the flash, etc. As you add more tgt
processes, at some point your cpu consumption might be non optimal for
the system.
Another point to consider is when the number of target processes gets
larger, protocols for which their per process hardware end-point such as
RDMA Completion Queue (CQ) is being a associated with dedicated/logical
interrupt handler, will give the NIC harder time to coalesce interrupts
caused by completions from different sessions. Under your patch with the
model is being thread based it might be possible to have multiple
pthreads/targets use the same CQ but then they would need to access it
through lock, which may cause other issues
I plan to clean up and merge this.
maybe this can be optional at this point?
Or.
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html