On 13 Jun 2022, at 19:47, Guo Ren wrote:
On Tue, Jun 14, 2022 at 3:49 AM Zi Yan <ziy@xxxxxxxxxx> wrote:
On 13 Jun 2022, at 12:32, Guo Ren wrote:
On Mon, Jun 13, 2022 at 11:23 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
Hi Xianting,
Thanks for your patch.
On 13 Jun 2022, at 9:10, Xianting Tian wrote:
Commit 787af64d05cd ("mm: page_alloc: validate buddy before check its migratetype.")
added buddy check code. But unfortunately, this fix isn't backported to
linux-5.17.y and the former stable branches. The reason is it added wrong
fixes message:
Fixes: 1dd214b8f21c ("mm: page_alloc: avoid merging non-fallbackable
pageblocks with others")
No, the Fixes tag is right. The commit above does need to validate buddy.
I think Xianting is right. The “Fixes:" tag is not accurate and the
page_is_buddy() is necessary here.
This patch could be applied to the early version of the stable tree
(eg: Linux-5.10.y, not the master tree)
This is quite misleading. Commit 787af64d05cd applies does not mean it is
intended to fix the preexisting bug. Also it does not apply cleanly
to commit d9dddbf55667, there is a clear indentation mismatch. At best,
you can say the way of 787af64d05cd fixing 1dd214b8f21c also fixes d9dddbf55667.
There is no way you can apply 787af64d05cd to earlier trees and call it a day.
You can mention 787af64d05cd that it fixes a bug in 1dd214b8f21c and there is
a similar bug in d9dddbf55667 that can be fixed in a similar way too. Saying
the fixes message is wrong just misleads people, making them think there is
no bug in 1dd214b8f21c. We need to be clear about this.
First, d9dddbf55667 is earlier than 1dd214b8f21c in Linus tree. The
origin fixes could cover the Linux-5.0.y tree if they give the
accurate commit number and that is the cause we want to point out.
Yes, I got that d9dddbf55667 is earlier and commit 787af64d05cd fixes
the issue introduced by d9dddbf55667. But my point is that 787af64d05cd
is not intended to fix d9dddbf55667 and saying it has a wrong fixes
message is misleading. This is the point I want to make.
Second, if the patch is for d9dddbf55667 then it could cover any tree
in the stable repo. Actually, we only know Linux-5.10.y has the
problem.
But it is not and does not apply to d9dddbf55667 cleanly.
Maybe, Gregkh could help to direct us on how to deal with the issue:
(Fixup a bug which only belongs to the former stable branch.)
I think you just need to send this patch without saying “commit
787af64d05cd fixes message is wrong” would be a good start. You also
need extra fix to mm/page_isolation.c for kernels between 5.15 and 5.17
(inclusive). So there will need to be two patches:
1) your patch to stable tree prior to 5.15 and
2) your patch with an additional mm/page_isolation.c fix to stable tree
between 5.15 and 5.17.
Also, you will need to fix the mm/page_isolation.c code too to make this patch
complete, unless you can show that PFN=0x1000 is never going to be encountered
in the mm/page_isolation.c code I mentioned below.
No, we needn't fix mm/page_isolation.c in linux-5.10.y, because it had
pfn_valid_within(buddy_pfn) check after __find_buddy_pfn() to prevent
buddy_pfn=0.
The root cause comes from __find_buddy_pfn():
return page_pfn ^ (1 << order);
Right. But pfn_valid_within() was removed since 5.15. So your fix is
required for kernels between 5.15 and 5.17 (inclusive).
When page_pfn is the same as the order size, it will return the
previous buddy not the next. That is the only exception for this
algorithm, right?
In fact, the bug is a very long time to reproduce and is not easy to
debug, so we want to contribute it to the community to prevent other
guys from wasting time. Although there is no new patch at all.
Thanks for your reporting and sending out the patch. I really
appreciate it. We definitely need your inputs. Throughout the email
thread, I am trying to help you clarify the bug and how to fix it
properly:
1. The commit 787af64d05cd does not apply cleanly to commits
d9dddbf55667, meaning you cannot just cherry-pick that commit to
fix the issue. That is why we need your patch to fix the issue.
And saying it has a wrong fixes message in this patch’s git log is
misleading.
2. For kernels between 5.15 and 5.17 (inclusive), an additional fix
to mm/page_isolation.c is also needed, since pfn_valid_within() was
removed since 5.15 and the issue can appear during page isolation.
3. For kernels before 5.15, this patch will apply.