RE: [PATCH 11/11] scsi: storvsc: Re-init stor_chns when a channel interrupt is re-assigned

Long Li <longli@xxxxxxxxxxxxx> · Wed, 8 Apr 2020 02:25:52 +0000

>Subject: Re: [PATCH 11/11] scsi: storvsc: Re-init stor_chns when a channel
>interrupt is re-assigned
>
>> >@@ -621,6 +621,63 @@ static inline struct storvsc_device
>> >*get_in_stor_device(
>> >
>> > }
>> >
>> >+void storvsc_change_target_cpu(struct vmbus_channel *channel, u32
>> >+old,
>> >+u32 new) {
>> >+	struct storvsc_device *stor_device;
>> >+	struct vmbus_channel *cur_chn;
>> >+	bool old_is_alloced = false;
>> >+	struct hv_device *device;
>> >+	unsigned long flags;
>> >+	int cpu;
>> >+
>> >+	device = channel->primary_channel ?
>> >+			channel->primary_channel->device_obj
>> >+				: channel->device_obj;
>> >+	stor_device = get_out_stor_device(device);
>> >+	if (!stor_device)
>> >+		return;
>> >+
>> >+	/* See storvsc_do_io() -> get_og_chn(). */
>> >+	spin_lock_irqsave(&device->channel->lock, flags);
>> >+
>> >+	/*
>> >+	 * Determines if the storvsc device has other channels assigned to
>> >+	 * the "old" CPU to update the alloced_cpus mask and the stor_chns
>> >+	 * array.
>> >+	 */
>> >+	if (device->channel != channel && device->channel->target_cpu ==
>> >old) {
>> >+		cur_chn = device->channel;
>> >+		old_is_alloced = true;
>> >+		goto old_is_alloced;
>> >+	}
>> >+	list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) {
>> >+		if (cur_chn == channel)
>> >+			continue;
>> >+		if (cur_chn->target_cpu == old) {
>> >+			old_is_alloced = true;
>> >+			goto old_is_alloced;
>> >+		}
>> >+	}
>> >+
>> >+old_is_alloced:
>> >+	if (old_is_alloced)
>> >+		WRITE_ONCE(stor_device->stor_chns[old], cur_chn);
>> >+	else
>> >+		cpumask_clear_cpu(old, &stor_device->alloced_cpus);
>>
>> If the old cpu is not allocated, is it still necessary to do a cpumask_clear_cpu?
>
>AFAICT, this really depends on how much we "believe" in the current heuristic
>(as implemented by get_og_chn()):  ;-)
>
>The cpumask_clear_cpu() (and the below, dependent "flush" as well) are
>intended to re-initialize alloced_cpus and stor_chns in order for get_og_chn()
>to re-process/update them.
>
>Also, notice that (both in the current code and after this series) alloced_cpus
>can't be offlined and get_og_chn() does rely on this property (cf., e.g., the
>loop/check over alloced_cpus/node_mask).
>
>I suspect that giving up on this invariant/property would require a certain
>amount of re-design in the heuristic/code in question...
>
>
>> >@@ -1360,7 +1432,14 @@ static int storvsc_do_io(struct hv_device
>*device,
>> > 			}
>> > 		}
>> > 	} else {
>> >+		spin_lock_irqsave(&device->channel->lock, flags);
>> >+		outgoing_channel = stor_device->stor_chns[q_num];
>> >+		if (outgoing_channel != NULL) {
>> >+			spin_unlock_irqrestore(&device->channel->lock,
>> >flags);
>>
>> Checking outgoing_channel again seems unnecessary. Why not just call
>get_og_chn()?
>
>target_cpu_store() might have changed stor_chns (and alloced_cpus) in the
>meantime (but before we've acquired the device's lock): the double check is
>to make sure we have a "consistent"/an up-to-date view of stor_chns and
>alloced_cpus.
>
>
>>
>> >+			goto found_channel;
>> >+		}
>> > 		outgoing_channel = get_og_chn(stor_device, q_num);
>> >+		spin_unlock_irqrestore(&device->channel->lock, flags);
>> > 	}
>>
>> With device->channel->lock, now we have one more lock on the I/O issuing
>path. It doesn't seem optimal as you are trying to protect the code in
>storvsc_change_target_cpu(), this doesn't need to block concurrent I/O
>issuers. Maybe moving to RCU is a better approach?
>
>I don't see this as a problem (*and I've validated such conclusion in
>experiments, where the "patched kernel" was sometimes performing slighlty
>better than the "unpatched kernel" and sometimes slightly
>worse...):
>
>On the one hand, the stor_chns array "stabilizes" quite early after system
>initialization in "normal" (i.e., common) situations (i.e., no channel
>reassignments, no device hotplugs...); IOW, get_og_chn() really represents
>the "rare and slow" path here (but not that slow!
>after all...).  Furthermore, notice that even in those "rare cases"
>the number of "contending" channels is limited to at most 1 per 4 CPUs IIRC
>(alloced_cpus is "sparsely populated"...).

Yes I realized it is on the slow path. There is no need to optimize locks.

Reviewed-by; Long Li <longli@xxxxxxxxxxxxx>

>
>The latencies of the RCU grace period (in the order of milliseconds) would be a
>major concern for the adoption of RCU here (at least, if we continue to
>consider get_og_chn() as an "updater").  I'm afraid that this could be "too
>slow" even for our slow path...  ;-/
>
>What am I missing?  ;-)
>
>Thanks,
>  Andrea