Re: [PATCH] [v2]aacraid: Reply queue mapping to CPUs based on IRQ affinity

John Meneghini <jmeneghi@xxxxxxxxxx> · Thu, 13 Feb 2025 17:21:35 -0500

Sorry, we didn't see a panic with the offline cpu test, we saw a Call Trace.

He's a little more information from our testing.

These are notes from our QA group who tested this patch.

---

With "aac_cpu_offline_feature=0", the system continued to have issues with the offline_cpu test:

Sep 03 14:49:45 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid: Host adapter abort request.
                                                                 aacraid: Outstanding commands on (0,1,3,0):
Sep 03 14:50:12 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid: Host adapter abort request.
                                                                 aacraid: Outstanding commands on (0,1,3,0):
Sep 03 14:50:15 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid: Host adapter abort request.
                                                                 aacraid: Outstanding commands on (0,1,3,0):
Sep 03 14:50:22 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid: Host adapter abort request.
                                                                 aacraid: Outstanding commands on (0,1,3,0):
Sep 03 14:50:22 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid: Host bus reset request. SCSI hang ?
Sep 03 14:50:22 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid 0000:84:00.0: outstanding cmd: midlevel-0
Sep 03 14:50:22 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid 0000:84:00.0: outstanding cmd: lowlevel-0
Sep 03 14:50:22 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid 0000:84:00.0: outstanding cmd: error handler-2
Sep 03 14:50:22 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid 0000:84:00.0: outstanding cmd: firmware-0
Sep 03 14:50:22 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid 0000:84:00.0: outstanding cmd: kernel-0
Sep 03 14:50:23 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid 0000:84:00.0: Controller reset type is 3
Sep 03 14:50:23 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: aacraid 0000:84:00.0: Issuing IOP reset
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: INFO: task kworker/u513:2:478 blocked for more than 122 seconds.
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:       Not tainted 5.14.0-503.5118_1431178045.el9.x86_64 #1
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: task:kworker/u513:2  state:D stack:0     pid:478   tgid:478   ppid:2      flags:0x00004000
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel: Call Trace:
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  <TASK>
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  __schedule+0x229/0x550
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  schedule+0x2e/0xd0
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  xlog_wait_on_iclog+0x16b/0x180 [xfs]
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  ? __pfx_default_wake_function+0x10/0x10
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  xlog_cil_push_work+0x6c6/0x700 [xfs]
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  ? srso_return_thunk+0x5/0x5f
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  process_one_work+0x197/0x380
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  worker_thread+0x2fe/0x410
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  ? __pfx_worker_thread+0x10/0x10
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  kthread+0xe0/0x100
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  ? __pfx_kthread+0x10/0x10
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  ret_from_fork+0x2c/0x50
Sep 03 14:52:03 storageqe-34.fast.eng.rdu2.dc.redhat.com kernel:  </TASK>

Using John's test kernel, I enabled the aac_cpu_offline_feature:

I then rebooted and ran the offline cpu test  - no crashes or hangs were observed.

I generated I/O with FIO and observed the following stats:

# fio -filename=/home/test1G.img -iodepth=64 -thread -rw=randwrite -ioengine=libaio -bs=4K -direct=1 -runtime=300 -time_based -size=1G -group_reporting -name=mytest -numjobs=4

  WRITE: bw=495MiB/s (519MB/s), 495MiB/s-495MiB/s (519MB/s-519MB/s), io=145GiB (156GB), run=300001-300001msec

I then ran FIO again with aacraid aac_cpu_offline_feature=0 - statistics below:

# fio -filename=/home/test1G.img -iodepth=64 -thread -rw=randwrite -ioengine=libaio -bs=4K -direct=1 -runtime=300 -time_based -size=1G -group_reporting -name=mytest -numjobs=4

  WRITE: bw=505MiB/s (529MB/s), 505MiB/s-505MiB/s (529MB/s-529MB/s), io=148GiB (159GB), run=300001-300001msec

/John

On 2/13/25 5:03 PM, John Meneghini wrote:
From: Martin K. Petersen
Sent: Wednesday, February 12, 2025 6:56 PM

[You don't often get email from "martin.petersen@oracle.comjames.bottomley@hansenpartnership.comjmeneghi"@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ;]

Appears to still be a problem. I'll work with Sagar and see if we can clean this up.

Add a new modparam "aac_cpu_offline_feature" to control CPU offlining. By default, it's disabled (0), but can be enabled during driver load

with:
        insmod ./aacraid.ko aac_cpu_offline_feature=1

We are very hesitant when it comes to adding new module parameters. And
why wouldn't you want offlining to just work? Is the performance penalty
really substantial enough that we have to introduce an explicit "don't
be broken" option?

Yes, this is something that we debated about internally, before asking Sagar to send this patch.

I agree that it would be much better if we simply fix the driver and make offline_cpu suport work.

The modparam was added as a compromise, to allow current users and customers who do NOT care about
cpu_offline support to keep the increased performance they want.  People generally complain any
time there is a performance regression.

The current upstream driver is more or less unchanged when the mod param is of off, which is the default.
So upstream users will see no performance regression... but don't try to offline a cpu or you will see
a panic. This is the state of the current upstream driver.

Thank you for your time to review and giving your valuable opinion.
There are two reasons why I chose the modparam way
1) As you rightly guessed - the performance penalty is high when it comes to few RAID level configurations - which is not desired
2) Not a lot of people would use CPU offlining feature as part of their regular usage. This is mostly for admin purposes.

These two reasons made me opt for the modparam.
We and our folks at RedHat did venture into trying few other options - but this seemed like a nice fit.

Another option we thought about was making this a kconfig option. We have a patch that replaces the modparam with
a Kconfig option.

However, I agree it would be better to just fix the driver, performance impact notwithstanding, and ship it. For
my part I'd rather have a correctly functioning driver, that's slower, but doesn't panic.

It's really up to the upstream community.  We need to understand what they want.

/John