On 2/24/25 22:15, John Meneghini wrote:
On 2/20/25 9:38 PM, Martin K. Petersen wrote:
John,
However, I agree it would be better to just fix the driver,
performance impact notwithstanding, and ship it. For my part I'd
rather have a correctly functioning driver, that's slower, but doesn't
panic.
I prefer to have a driver that doesn't panic when the user performs a
reasonably normal administrative action.
Agreed. The only clarification I want to make is that users will
not see a panic, they will see IO timeouts and Host bus resets.
It was my mistake to report earlier that the host would panic.
When aac_cpu_offline_feature is disabled users will see higher performance
but if they start off-lining CPUS they may see IO timeouts. This is the
state of the current driver and this is the problem which the original
patch:
commit 9dc704dcc09e ("scsi: aacraid: Reply queue mapping to CPUs based
on IRQ affinity")
was supposed to have fixed. The problem was the original patch didn't
fix the
problem correctly and it resulted in the regression reported in Bugzilla
217599[1].
This patch circles back and fixes the original problem correctly. The extra
'aac_cpu_offline_feature' modparam was added to disable the new code path
because of concerns raised during our testing at Red Hat about reduced
performance with this patch.
If go-faster stripes are desired in specific configurations, then make
the performance mode an opt-in. Based on your benchmarks, however, I'm
not entirely convinced it's worth it...
I agree. So how about if we can just take out the
aac_cpu_offline_feature modparam...?
Alternatively we can replace the modparam with a kConfig option. The
default setting for the new Kconfig option will be offline_cpu_support_on and
performance_mode_off. That way we can ship a default kernel configuration that
> provides a working aacraid driver which safely supports off-lining
> CPUS. If people are really unhappy with the performance, and they>
don't care about offline cpu support, they can re-config their kernel.
Personally I prefer option 1, but we the thoughts of the upstream users.
I've added the original authors of Bugzilla 217599[1] to the cc list to
get their attention and review.
Do we have an idea what these 'specific use-cases' are?
And how much performance impact we have?
I could imagine a single-threaded workload driving just one blk-mq queue
would benefit from spreading out onto several interrupts.
But then, this would be true for most of the multiqueue drivers; and
indeed quite some drivers (eg megaraid_sas & mpt3sas
'smp_affinity_enable') have the very same module option.
Wouldn't it be an idea to check if we can make this a generic / blk-mq
queue option instead of having each driver to implement the same
functionality on it's own?
Topic for LSF?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich