On 22.05.2024 07:37, Zhaoyang Huang wrote:
On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@xxxxxxxxxxxxxx> wrote:
On 21.05.2024 03:00, Zhaoyang Huang wrote:
On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@xxxxxxxxx> wrote:
On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@xxxxxxxxxxxxxx> wrote:
On 15.04.2024 03:50, Zhaoyang Huang wrote:
I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
However, with long-term kernels 6.1.XX and 6.6.XX,
(tested at least 10 different versions), this lockup always appears
after 2-30 days, similar to the report in the original thread.
The more load (for example, copying a lot of local files while
serving 20Gbps traffic), the higher the chance that the bug will appear.
I haven't been able to reproduce this during synthetic tests,
but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
If anyone can provide a patch, I can test it on multiple machines
over the next few days.
Could you please try this one which could be applied on 6.6 directly. Thank you!
URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-zhaoyang.huang@xxxxxxxxxx/
Unfortunately, I am unable to cleanly apply this patch against the
latest 6.6.31
Please try below one which works on my v6.6 based android. Thank you
for your test in advance :D
mm/huge_memory.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
I have compiled 6.6.31 with this patch and will test it on multiple
machines over the next 30 days. I will provide an update after 30 days
if everything is fine or sooner if any of the hosts experience the same
soft lockup again.