On 9/28/21 6:24 AM, Song Liu wrote:
On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
<rongwei.wang@xxxxxxxxxxxxxxxxx> wrote:
On 9/24/21 10:43 AM, Andrew Morton wrote:
On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@xxxxxxxxxxxxxxxxx> wrote:
On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
Transparent huge page has supported read-only non-shmem files. The file-
backed THP is collapsed by khugepaged and truncated when written (for
shared libraries).
However, there is race in two possible places.
1) multiple writers truncate the same page cache concurrently;
2) collapse_file rolls back when writer truncates the page cache;
As I've said before, the bug here is that somehow there is a writable fd
to a file with THPs. That's what we need to track down and fix.
Hi, Matthew
I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
...
https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@xxxxxxxxxxxxxxxxxxxx/
All in all, what you mean is that we should solve this race at the source?
Matthew is being pretty clear here: we shouldn't be permitting
userspace to get a writeable fd for a thp-backed file.
Why are we permitting the DSO to be opened writeably? If there's a
legitimate case for doing this then presumably "mm, thp: relax the
There is a use case to stress file-backed THP within attachment.
I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
$ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
$ ulimit -s unlimited
$ ./stress_madvise_dso 10000 <libtest.so>
the meaning of above parameters:
10000: the max test time;
<libtest.so>: the DSO that will been mapped into file-backed THP by
madvise. It recommended that the text segment of DSO to be tested is
greater than 2M.
The crash will been triggered at once in the latest kernel. And this
case also can used to trigger the bug that mentioned in our another patch.
Hmm.. I am not able to use the repro program to crash the system. Not
sure what I did wrong.
Hi
I have tried to check my test case again. Can you make sure the DSO that
you test have THP mapping?
If you are willing to try again, I can send my libtest.c which is used
to test by myself (actually, it shouldn't be target DSO problem).
Thanks very much!
OTOH, does it make sense to block writes within khugepaged, like:
diff --git i/mm/khugepaged.c w/mm/khugepaged.c
index 045cc579f724e..ad7c41ec15027 100644
--- i/mm/khugepaged.c
+++ w/mm/khugepaged.c
@@ -51,6 +51,7 @@ enum scan_result {
SCAN_CGROUP_CHARGE_FAIL,
SCAN_TRUNCATED,
SCAN_PAGE_HAS_PRIVATE,
+ SCAN_BUSY_WRITE,
};
#define CREATE_TRACE_POINTS
@@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
/* Only allocate from the target node */
gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
+ if (deny_write_access(file)) {
+ result = SCAN_BUSY_WRITE;
+ return;
+ }
+
This can indeed avoid some possible races from source.
But, I am thinking about whether this will lead to DDoS attack?
I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
is that DDoS attack. In addition, 'deny_write_access' will change
the behavior, such as user will get 'Text file busy' during
collapse_file. I am not sure whether the behavior changing is acceptable
in user space.
If it is acceptable, I am very willing to fix the races like your way.
Thanks!
new_page = khugepaged_alloc_page(hpage, gfp, node);
if (!new_page) {
result = SCAN_ALLOC_HUGE_PAGE_FAIL;
@@ -1863,19 +1869,6 @@ static void collapse_file(struct mm_struct *mm,
else {
__mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
filemap_nr_thps_inc(mapping);
- /*
- * Paired with smp_mb() in do_dentry_open() to ensure
- * i_writecount is up to date and the update to nr_thps is
- * visible. Ensures the page cache will be truncated if the
- * file is opened writable.
- */
- smp_mb();
- if (inode_is_open_for_write(mapping->host)) {
- result = SCAN_FAIL;
- __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
- filemap_nr_thps_dec(mapping);
- goto xa_locked;
- }
}
if (nr_none) {
@@ -1976,6 +1969,7 @@ static void collapse_file(struct mm_struct *mm,
VM_BUG_ON(!list_empty(&pagelist));
if (!IS_ERR_OR_NULL(*hpage))
mem_cgroup_uncharge(*hpage);
+ allow_write_access(file);
/* TODO: tracepoints */
}
#ifndef _DEFAULT_SOURCE
#define _DEFAULT_SOURCE
#endif
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>
/* "volatile" to forbid compiler optimization */
volatile static int x;
#define DO_0 ++x;
#define DO_1 {DO_0; DO_0; DO_0; DO_0; DO_0; DO_0; DO_0; DO_0; DO_0}
#define DO_2 {DO_1; DO_1; DO_1; DO_1; DO_1; DO_1; DO_1; DO_1; DO_1}
#define DO_3 {DO_2; DO_2; DO_2; DO_2; DO_2; DO_2; DO_2; DO_2; DO_2}
#define DO_4 {DO_3; DO_3; DO_3; DO_3; DO_3; DO_3; DO_3; DO_3; DO_3}
#define DO_5 {DO_4; DO_4; DO_4; DO_4; DO_4; DO_4; DO_4; DO_4; DO_4}
#define DO_6 {DO_5; DO_5; DO_5; DO_5; DO_5; DO_5; DO_5; DO_5; DO_5}
#define DO_7 {DO_6; DO_6; DO_6; DO_6; DO_6; DO_6; DO_6; DO_6; DO_6}
void libtest_work1(void)
{
printf("work 1\n");
DO_0;
}
void libtest_work2(void)
{
printf("work 2\n");
DO_2;
}
void libtest_work3(void)
{
printf("work 3\n");
DO_4;
}
void libtest_work4(void)
{
printf("work 4\n");
DO_6;
}