BUG root-caused: careless processing of pagevec causes "Bad page states"

Valery Podrezov <pvadop@xxxxxxxxx> · Wed, 20 Feb 2013 17:16:43 +0300

SUMMARY: careless processing of pagevec causes "Bad page states"

I have the messages "BUG: Bad page state in process.." in SMP mode with two cpus (kernel 3.3).

I have root-caused the problem, see description below.

I have prepared the temporary workaround, it helps to eliminate the problem and demonstrates additionally the essence of the problem.

The following sections are provided below:

    DESCRIPTION

    ENVIRONEMENT

    OOPS-messages

    WORKAROUND

Is it a known issue and is there already the patch properly fixing it?

Feel free to ask me any questions.

Best Regards,

 Valery Podrezov

DESCRIPTION:

There is how the problem is generated

(PFN0 refers the problematical physical page,

(1) and (2) are successive points of execution):

1. cpu 0: ...

   cpu 1: is running the user process (PROC0)

          Gets the new page with the PFN0 from free list by alloc_page_vma()

          Runs page_add_new_anon_rmap(), thus the page PFN0 occurs in 
pagevec of this cpu (it is 5-th): pvec = 
&get_cpu_var(lru_add_pvecs)[lru];

          Runs fork (PROC1 - the generated child process)

          The page PFN0 is present in the page tables of the child process PROC1 (it is read-only, to be COWed)

2. cpu 0: is running PROC1

          writes to the virtual address (VA1) translated through its page tables to the PFN0

          do_page_fault (data) on VA1 (physical page is present in the page tables of the process, but no write permissions)

   cpu 1: is running PROC1

          do_page_fault (data) on some virtual address (no page in page tables)

          Gets the new page from free list by alloc_page_vma()

          Runs page_add_new_anon_rmap(), then __lru_cache_add()

          This new page is just 14-th in pagevec of this cpu, so runs __pagevec_lru_add(),

          then pagevec_lru_move_fn() and, finally, __pagevec_lru_add_fn()

There are no common locks at this point applied for both processes simultaneously,

these locks are applied: 

   core 0: PROC0->mm->mmap_sem

           PFN0->flags PG_locked (lock_page)

   core 1: PROC1->mm->mmap_sem (!= PROC0->mm->mmap_sem)

           PFN0->zone->lru_lock

The more detailed timing below of point (2) for both cpus

shows how the bit PG_locked is mistakenly generated for the PFN0.

   Both cpus are processing do_page_fault() (see above)

   Both cpus are in the same routine do_wp_page()

   a) cpu 0: locks the page by trylock_page(old_page) (it is just the page with PFN0)

   b) cpu 1: is processing __pagevec_lru_add_fn()

             Reads page->flags of its 5-th element of pagevec (it is PFN0 page, it contains PG_locked set to 1, see (a))

   c) cpu 0: unlocks the page by unlock_page(old_page) (reset the bit PG_locked of PFN0 page)

   d) cpu 1: executes SetPageLRU(page) in __pagevec_lru_add_fn() and thus sets not only PG_lru

             bit of PFN0 page but, mistakenly, the bit PG_locked too

This leads to "BUG: Bad page state" later while releasing PFN0 page because of PG_locked bit present in flags of PFN0 page.

ENVIRONMENT:

   Linux kernel-3.3

OOPS-messages:

BUG: Bad page state in process runt_cj.sh  pfn:7fcd9

page:c05f9b20 count:0 mapcount:0 mapping:  (null) index:0xbfffd

page flags: 0x80080009(locked|uptodate|swapbacked)

Modules linked in:

Call Trace:

 [<00000000c1098d78>] dump_page+0x10c/0x120

 [<00000000c1098f50>] bad_page+0x1c4/0x1f4

 [<00000000c1099060>] free_pages_prepare+0xe0/0x10c

 [<00000000c109afd0>] free_hot_cold_page+0x38/0x2c8

 [<00000000c109b538>] free_hot_cold_page_list+0x38/0x64

 [<00000000c10a12f8>] release_pages+0x1e0/0x2cc

 [<00000000c10cdffc>] free_pages_and_swap_cache+0xa4/0x154

 [<00000000c10b49a0>] tlb_flush_mmu+0x98/0xcc

 [<00000000c10b49e4>] tlb_finish_mmu+0x10/0x54

 [<00000000c10c08a0>] exit_mmap+0x11c/0x168

 [<00000000c101988c>] mmput+0x5c/0x164

 [<00000000c10e85c0>] flush_old_exec+0x7d4/0xacc

 [<00000000c114ac24>] load_elf_binary+0x534/0x2514

 [<00000000c11c7158>] __up_read+0x20/0x108

 [<00000000c11cde48>] __va_probe_existent_region+0x164/0x190

 [<00000000c11ce098>] generic_copy_from_user+0xb4/0xd0

 [<00000000c10e7c10>] copy_strings+0x4d8/0x66c

 [<00000000c10e68ec>] search_binary_handler+0x110/0x488

 [<00000000c10e97f0>] do_execve+0x584/0x6a8

 [<00000000c10017c4>] sys_execve+0x38/0x104

 [<00000000c1013aec>] stub_execve+0x14/0x18

 [<00000000c100f1b4>] go_scall+0x30/0x38

Disabling lock debugging due to kernel taint

WORKAROUND:

I don't consider it as a potential patch at least because it doesn't support properly

the "WARNING, pagevec_add: no space in pvec" conditions, as well, it can impact performance, etc..

It requires further investigations.

Nevertheless, it helped me temporary not to stick in the problem.

There are the changed things per-files below.

linux-3.3/include/linux/pagevec.h:

/* 14 pointers + two long's align the pagevec structure to a power of two */

// #define PAGEVEC_SIZE    14

#define PAGEVEC_SIZE    (14 + 5*16)

static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)

{

    if (pvec->nr >= PAGEVEC_SIZE) {

        early_printk("WARNING, pagevec_add: no space in pvec 0x%lx, the 
page=0x%lx ????????????????!!!!!!!!!!!!!!!!\n", pvec, page);

        return (0);

    }

    pvec->pages[pvec->nr++] = page;

    return pagevec_space(pvec);

}

linux-3.3/mm/swap.c:

static void pagevec_lru_move_fn(struct pagevec *pvec,

                int (*move_fn)(struct page *page, void *arg),

                void *arg)

{

    int i;

    struct zone *zone = NULL;

    unsigned long flags = 0;

int processed;

struct page *page;

int slots_available = -1;

int not_processed_index = 0;

struct page *not_processed_pages[PAGEVEC_SIZE];

int processed_index = 0;

struct page *processed_pages[PAGEVEC_SIZE];

    for (i = 0; i < pagevec_count(pvec); i++) {

        struct page *page = pvec->pages[i];

        struct zone *pagezone = page_zone(page);

        if (pagezone != zone) {

            if (zone)

                spin_unlock_irqrestore(&zone->lru_lock, flags);

            zone = pagezone;

            spin_lock_irqsave(&zone->lru_lock, flags);

        }

        // (*move_fn)(page, arg);

if (trylock_page(page)) {

    (*move_fn)(page, arg);

    unlock_page(page);

    processed = 1;

} else {

    processed = 0;

}

if (processed) {

    processed_pages[processed_index++] = page;

} else {

    not_processed_pages[not_processed_index++] = page;

}

    }

    if (zone)

        spin_unlock_irqrestore(&zone->lru_lock, flags);

    // release_pages(pvec->pages, pvec->nr, pvec->cold);

if (processed_index) {

    release_pages(processed_pages, processed_index, pvec->cold);

}

    pagevec_reinit(pvec);

if (not_processed_index) {

    for (i = 0; i < not_processed_index; i++) {

        page = not_processed_pages[i];

        slots_available = pagevec_add(pvec, page);

    }

}

}

----<end>