On 01/11/2015 03:40 AM, Sagi Grimberg wrote: > On 1/9/2015 10:19 PM, Mike Christie wrote: >> On 01/09/2015 12:28 PM, Hannes Reinecke wrote: >>> On 01/09/2015 07:00 PM, Michael Christie wrote: >>>> >>>> On Jan 8, 2015, at 11:03 PM, Nicholas A. Bellinger >>>> <nab@xxxxxxxxxxxxxxx> wrote: >>>> >>>>> On Thu, 2015-01-08 at 15:22 -0800, James Bottomley wrote: >>>>>> On Thu, 2015-01-08 at 14:57 -0800, Nicholas A. Bellinger wrote: >>>>>>> On Thu, 2015-01-08 at 14:29 -0800, James Bottomley wrote: >>>>>>>> On Thu, 2015-01-08 at 14:16 -0800, Nicholas A. Bellinger wrote: >>>>> >>>>> <SNIP> >>>>> >>>>>>> The point is that a simple session wide counter for command sequence >>>>>>> number assignment is significantly less overhead than all of the >>>>>>> overhead associated with running a full multipath stack atop >>>>>>> multiple >>>>>>> sessions. >>>>>> >>>>>> I don't see how that's relevant to issue speed, which was the >>>>>> measure we >>>>>> were using: The layers above are just a hopper. As long as they're >>>>>> loaded, the MQ lower layer can issue at full speed. So as long as >>>>>> the >>>>>> multipath hopper is efficient enough to keep the queues loaded >>>>>> there's >>>>>> no speed degradation. >>>>>> >>>>>> The problem with a sequence point inside the MQ issue layer is >>>>>> that it >>>>>> can cause a stall that reduces the issue speed. so the counter >>>>>> sequence >>>>>> point causes a degraded issue speed over the multipath hopper >>>>>> approach >>>>>> above even if the multipath approach has a higher CPU overhead. >>>>>> >>>>>> Now, if the system is close to 100% cpu already, *then* the multipath >>>>>> overhead will try to take CPU power we don't have and cause a >>>>>> stall, but >>>>>> it's only in the flat out CPU case. >>>>>> >>>>>>> Not to mention that our iSCSI/iSER initiator is already taking a >>>>>>> session >>>>>>> wide lock when sending outgoing PDUs, so adding a session wide >>>>>>> counter >>>>>>> isn't adding any additional synchronization overhead vs. what's >>>>>>> already >>>>>>> in place. >>>>>> >>>>>> I'll leave it up to the iSER people to decide whether they're redoing >>>>>> this as part of the MQ work. >>>>>> >>>>> >>>>> Session wide command sequence number synchronization isn't >>>>> something to >>>>> be removed as part of the MQ work. It's a iSCSI/iSER protocol >>>>> requirement. >>>>> >>>>> That is, the expected + maximum sequence numbers are returned as >>>>> part of >>>>> every response PDU, which the initiator uses to determine when the >>>>> command sequence number window is open so new non-immediate >>>>> commands may >>>>> be sent to the target. >>>>> >>>>> So, given some manner of session wide synchronization is required >>>>> between different contexts for the existing single connection case to >>>>> update the command sequence number and check when the window opens, >>>>> it's >>>>> a fallacy to claim MC/S adds some type of new initiator specific >>>>> synchronization overhead vs. single connection code. >>>> >>>> I think you are assuming we are leaving the iscsi code as it is today. >>>> >>>> For the non-MCS mq session per CPU design, we would be allocating and >>>> binding the session and its resources to specific CPUs. They would only >>>> be accessed by the threads on that one CPU, so we get our >>>> serialization/synchronization from that. That is why we are saying we >>>> do not need something like atomic_t/spin_locks for the sequence number >>>> handling for this type of implementation. >>>> >>> Wouldn't that need to be coordinated with the networking layer? >> >> Yes. >> >>> Doesn't it do the same thing, matching TX/RX queues to CPUs? >> >> Yes. >> > > Hey Hannes, Mike, > > I would say there is no need for specific coordination from iSCSI PoV. > This is exactly what flow steering is designed for. As I see it, in > order to get the TX/RX to match rings, the user can attach 5-tuple rules > (using standard ethtool) to steer packets to the right rings. > > Sagi. > >>> If so, wouldn't we decrease bandwidth by restricting things to one CPU? >> >> We have a session or connection per CPU though, so we end up hitting the >> same problem you talked about last year where one hctx (iscsi session or >> connection's socket or nic hw queue) could get overloaded. This is what >> I meant in my original mail where iscsi would rely on whatever blk/mq >> load balancers we end up implementing at that layer to balance requests >> across hctxs. >> > > I'm not sure I understand, > > The submission flow is CPU bound. In the current single queue model > both CPU X and CPU Y will end up using a single socket. In the > multi-queue solution, CPU X will go to socket X and CPU Y will go to > socket Y. This is equal to what we have today (if only CPU X is active) > or better (if more CPUs are active). > > Am I missing something? I did not take Hannes's comment as comparing what we have today vs the proposal. I thought he was referring to the problem he was talking about at LSF last year and saying there could be cases where we want to spread IO across CPUs/queues and some cases where we would want to execute on the CPU we were originally submitted on. I was just saying the iscsi layer would not control that and would rely on the blk/mq layer to handle this or tell us what to do similar to what we do for the rq_affinity setting. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html