Re: [Lsf-pc] [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion

Sagi Grimberg <sagig@xxxxxxxxxxxxxxxxxx> · Sun, 11 Jan 2015 11:40:17 +0200

On 1/9/2015 10:19 PM, Mike Christie wrote:
On 01/09/2015 12:28 PM, Hannes Reinecke wrote:
On 01/09/2015 07:00 PM, Michael Christie wrote:

On Jan 8, 2015, at 11:03 PM, Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx> wrote:

On Thu, 2015-01-08 at 15:22 -0800, James Bottomley wrote:
On Thu, 2015-01-08 at 14:57 -0800, Nicholas A. Bellinger wrote:
On Thu, 2015-01-08 at 14:29 -0800, James Bottomley wrote:
On Thu, 2015-01-08 at 14:16 -0800, Nicholas A. Bellinger wrote:

<SNIP>

The point is that a simple session wide counter for command sequence
number assignment is significantly less overhead than all of the
overhead associated with running a full multipath stack atop multiple
sessions.

I don't see how that's relevant to issue speed, which was the measure we
were using: The layers above are just a hopper.  As long as they're
loaded, the MQ lower layer can issue at full speed.  So as long as the
multipath hopper is efficient enough to keep the queues loaded there's
no speed degradation.

The problem with a sequence point inside the MQ issue layer is that it
can cause a stall that reduces the issue speed. so the counter sequence
point causes a degraded issue speed over the multipath hopper approach
above even if the multipath approach has a higher CPU overhead.

Now, if the system is close to 100% cpu already, *then* the multipath
overhead will try to take CPU power we don't have and cause a stall, but
it's only in the flat out CPU case.

Not to mention that our iSCSI/iSER initiator is already taking a session
wide lock when sending outgoing PDUs, so adding a session wide counter
isn't adding any additional synchronization overhead vs. what's already
in place.

I'll leave it up to the iSER people to decide whether they're redoing
this as part of the MQ work.

Session wide command sequence number synchronization isn't something to
be removed as part of the MQ work.  It's a iSCSI/iSER protocol
requirement.

That is, the expected + maximum sequence numbers are returned as part of
every response PDU, which the initiator uses to determine when the
command sequence number window is open so new non-immediate commands may
be sent to the target.

So, given some manner of session wide synchronization is required
between different contexts for the existing single connection case to
update the command sequence number and check when the window opens, it's
a fallacy to claim MC/S adds some type of new initiator specific
synchronization overhead vs. single connection code.

I think you are assuming we are leaving the iscsi code as it is today.

For the non-MCS mq session per CPU design, we would be allocating and
binding the session and its resources to specific CPUs. They would only
be accessed by the threads on that one CPU, so we get our
serialization/synchronization from that. That is why we are saying we
do not need something like atomic_t/spin_locks for the sequence number
handling for this type of implementation.

Wouldn't that need to be coordinated with the networking layer?

Yes.

Doesn't it do the same thing, matching TX/RX queues to CPUs?

Yes.

Hey Hannes, Mike,

I would say there is no need for specific coordination from iSCSI PoV.
This is exactly what flow steering is designed for. As I see it, in
order to get the TX/RX to match rings, the user can attach 5-tuple rules
(using standard ethtool) to steer packets to the right rings.

Sagi.

If so, wouldn't we decrease bandwidth by restricting things to one CPU?

We have a session or connection per CPU though, so we end up hitting the
same problem you talked about last year where one hctx (iscsi session or
connection's socket or nic hw queue) could get overloaded. This is what
I meant in my original mail where iscsi would rely on whatever blk/mq
load balancers we end up implementing at that layer to balance requests
across hctxs.

I'm not sure I understand,

The submission flow is CPU bound. In the current single queue model
both CPU X and CPU Y will end up using a single socket. In the
multi-queue solution, CPU X will go to socket X and CPU Y will go to
socket Y. This is equal to what we have today (if only CPU X is active)
or better (if more CPUs are active).

Am I missing something?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html