Re: Difference between major page fault and minor page fault

Peter Teoh <htmldeveloper@xxxxxxxxx> · Wed, 11 Nov 2009 10:57:03 -0500

On Tue, Nov 10, 2009 at 7:30 AM, Shameem Ahamed <shameem.ahamed@xxxxxxxxx> wrote:

Hi Ed, Shailesh,

Thanks for the replies.

I have gone through the handle_pte_fault function in memory.c

It seems like, it handles VM page faults. I am more concerned with the physical page faults. As i can see from the code,  it allocates new pages for the VMA. But if the VMA is backed by a disk file the contents of the file should also be read to the RAM. VM_FAULT_MINOR and VM_FAULT_MAJOR are related to VM minor faults and major faults.

I want to get more information regarding the physical page faults. Once the process is created VMA for the process created, and VM pages are allocated on demand (when the fault occurs) and the data will be read from DISK to RAM if it is not present.

Eg: I am running an application, called EG. When EG is started VMA for EG will be created, virtual pages will be allocated and a txt, data, and other required parts of EG will be loaded and mapped to the virtual pages.

I am looking out for the function which copies pages from disk to ram.

Can anyone please help me ?.

Sure, but there are many answer possible, mine not nec correct.

looking into a dynamic stack trace (using stap):

  11647  0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]

  11648  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11649  0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]

  11650  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11651  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11652  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11653  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]

  11654  0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
  11655  0xffffffff81150fb2 : get_super+0x39/0x112 [kernel] (inexact)
  11656  0xffffffff81187fb7 : flush_disk+0x1d/0xc8 [kernel] (inexact)
  11657  0xffffffff811880d8 : check_disk_change+0x76/0x87 [kernel] (inexact)

  11658  0xffffffff8105c536 : finish_task_switch+0x4f/0x151 [kernel] (inexact)
  11659  0xffffffff81503990 : thread_return+0x115/0x17e [kernel] (inexact)
  11660  0xffffffff812644db : kobject_get+0x28/0x37 [kernel] (inexact)

  11661  0xffffffff81189557 : __blkdev_get+0xf5/0x4e9 [kernel] (inexact)

From above, the blk_fetch_request() functions start after blk_start_request() has started, which is the start of the I/O processing.

   1873 /**
   1874  * blk_start_request - start request processing on the driver
   1875  * @req: request to dequeue
   1876  *
   1877  * Description:
   1878  *     Dequeue @req and start timeout timer on it.  This hands off the

   1879  *     request to the driver.
   1880  *
   1881  *     Block internal functions which don't want to start timer should
   1882  *     call blk_dequeue_request().
   1883  *
   1884  * Context:

   1885  *     queue_lock must be held.
   1886  */
   1887 void blk_start_request(struct request *req)
   1888 {

API of blk_* are defined in include/linux/blkdev.h - take a look and u can see that the APIs are based on sectors:

    921 extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
    922                                         spinlock_t *lock, int node_id);
    923 extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);

    924 extern void blk_cleanup_queue(struct request_queue *);
    925 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
    926 extern void blk_queue_bounce_limit(struct request_queue *, u64);

    927 extern void blk_queue_max_sectors(struct request_queue *, unsigned int);
    928 extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
    929 extern void blk_queue_max_phys_segments(struct request_queue *, unsigned short);

    930 extern void blk_queue_max_hw_segments(struct request_queue *, unsigned short);
    931 extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
    932 extern void blk_queue_max_discard_sectors(struct request_queue *q,

for example one of these function:

    628 
    629 unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p)
    630 {
    631         struct address_space *mapping = bdev->bd_inode->i_mapping;

    632         struct page *page;

    633 
    634         page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_CACHE_SHIFT-9)),
    635                                  NULL);
    636         if (!IS_ERR(page)) {
    637                 if (PageError(page))

    638                         goto fail;
    639                 p->v = page;
    640                 return (unsigned char *)page_address(page) +  ((n & ((1 << (PAGE_CACHE_SHIFT - 9)) - 1)) << 9);

    641 fail:
    642                 page_cache_release(page);
    643         }
    644         p->v = NULL;
    645         return NULL;
    646 }

from above, read_mapping_page() will read the data based on the address space "mapping".   these mapping (which also contain the byte/sector offset information to read) will ultimately get mapped into the hardware drivers (eg, depending whether it is SATA or IDE) API:

in drivers/ata/libata-core.c there is one read function:

   5221 /**
   5222  *      sata_scr_read - read SCR register of the specified port
   5223  *      @link: ATA link to read SCR for
   5224  *      @reg: SCR to read

   5225  *      @val: Place to store read value
   5226  *
   5227  *      Read SCR register @reg of @link into *@val.  This function is
   5228  *      guaranteed to succeed if @link is ap->link, the cable type of

   5229  *      the port is SATA and the port implements ->scr_read.
   5230  *
   5231  *      LOCKING:
   5232  *      None if @link is ap->link.  Kernel thread context otherwise.
   5233  *
   5234  *      RETURNS:

   5235  *      0 on success, negative errno on failure.
   5236  */
   5237 int sata_scr_read(struct ata_link *link, int reg, u32 *val)
   5238 {
   5239         if (ata_is_host_link(link)) {
   5240                 if (sata_scr_valid(link))

   5241                         return link->ap->ops->scr_read(link, reg, val);
   5242                 return -EOPNOTSUPP;
   5243         }
   5244 
   5245         return sata_pmp_scr_read(link, reg, val);

   5246 }

this will read from the hardware specific API function pointer in scr_read().   For example, in Marvell SATA, it is implemented as mv_scr_read() (in sata_mv.c):

   1316 static int mv_scr_read(struct ata_link *link, unsigned int sc_reg_in, u32 *val)

   1317 {
   1318         unsigned int ofs = mv_scr_offset(sc_reg_in);
   1319 
   1320         if (ofs != 0xffffffffU) {
   1321                 *val = readl(mv_ap_base(link->ap) + ofs);
   1322                 return 0;

   1323         } else
   1324                 return -EINVAL;
   1325 }

which is the ultimate 4 byte read from the DMA memory.   (which is DMA from/to the harddisk, so effectively reading from harddisk)

More info can be found in the libata developer's guide.

More stap traces will show a lot more variation in how reading can be triggered:

    7243 kblockd/0(132): <- blk_fetch_request
     0 hald-addon-stor(2412): -> blk_fetch_request

 0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
 0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
 0xffffffff81241245 : elv_insert+0xea/0x2db [kernel]

 0xffffffff81241516 : __elv_add_request+0xe0/0xef [kernel]
 0xffffffff81253fb9 : blk_execute_rq_nowait+0xb1/0x109 [kernel]
 0xffffffff812540b5 : blk_execute_rq+0xa4/0xdf [kernel]
 0xffffffff8124d7d5 : blk_put_request+0x57/0x66 [kernel] (inexact)

 0xffffffff81259dfd : scsi_cmd_ioctl+0x755/0x771 [kernel] (inexact)
 0xffffffff812644db : kobject_get+0x28/0x37 [kernel] (inexact)
 0xffffffff81257440 : get_disk+0x108/0x13b [kernel] (inexact)
 0xffffffff81189847 : __blkdev_get+0x3e5/0x4e9 [kernel] (inexact)

 0xffffffff81255c58 : __blkdev_driver_ioctl+0x80/0xb1 [kernel] (inexact)
 0xffffffff81256abb : blkdev_ioctl+0xd8a/0xdb0 [kernel] (inexact)

  6670 hald-addon-stor(2412): <- blk_fetch_request
     0 hald-addon-stor(2412): -> blk_fetch_request

 0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
 0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
 0xffffffff81241245 : elv_insert+0xea/0x2db [kernel]

 0xffffffff81241516 : __elv_add_request+0xe0/0xef [kernel]
 0xffffffff81253fb9 : blk_execute_rq_nowait+0xb1/0x109 [kernel]
 0xffffffff812540b5 : blk_execute_rq+0xa4/0xdf [kernel]
 0xffffffff81189847 : __blkdev_get+0x3e5/0x4e9 [kernel] (inexact)

 0xffffffff81256abb : blkdev_ioctl+0xd8a/0xdb0 [kernel] (inexact)
 0xffffffff8118996b : blkdev_open+0x0/0x107 [kernel] (inexact)
 0xffffffff8115f8d3 : do_filp_open+0x839/0xfd7 [kernel] (inexact)
 0xffffffff813471bc : put_device+0x25/0x2e [kernel] (inexact)

 0xffffffff81189284 : __blkdev_put+0xea/0x20f [kernel] (inexact)
 0xffffffff811893c0 : blkdev_put+0x17/0x20 [kernel] (inexact)
 0xffffffff8118941a : blkdev_close+0x51/0x5d [kernel] (inexact)
 0xffffffff8114fed2 : __fput+0x1bb/0x308 [kernel] (inexact)

Regards,

Shameem

----- Original Message ----

> From: shailesh jain <coolworldofshail@xxxxxxxxx>

> To: Ed Cashin <ecashin@xxxxxxxxxx>

> Cc: kernelnewbies@xxxxxxxxxxxx

> Sent: Tue, November 10, 2009 3:23:59 AM

> Subject: Re: Difference between major page fault and minor page fault

>

> Minor faults can occur at many places:

>

> 1) Shared pages among processes/ Swap cache - A process can take a

> page fault when the page is already present in the swap cache. This

> will be minor fault since you will not go to disk.

>

> 2) COW but no fork - Memory is not allocated initially for malloc. It

> will point to global zero page, however when the process attempts to

> write to it you will get minor page fault.

>

> 3) Stack is expanding. Check if the fault occurred closed to the

> bottom the stack, if yes then let's allocate page under the assumption

> that stack is expanding.

>

> 4) Vmalloc address space. Page fault can occur in kernel address space

> for vmalloc area. When this happens you sync up process' page tables

> with master page table (init_mm).

>

> 5) COW for fork.

>

>

> Shailesh Jain

>

> On Mon, Nov 9, 2009 at 7:08 AM, Ed Cashin wrote:

> > Shameem Ahamed writes:

> >

> >> Hi,

> >>

> >> Can anyone explain the difference between major and minor page faults.

> >>

> >> As far as I know, major page fault follows a disk access to retrieve

> >> the data. Minor page fault occurs mainly for COW pages. Is there any

> >> other instances other than COW, where there will be a minor page

> >> fault. Which kernel function handles the major page fault ?.

> >

> > Ignoring error cases, arch/x86/mm/fault.c:do_page_fault calls

> > mm/memory.c:handle_mm_fault and looks for the flags, VM_FAULT_MAJOR or

> > VM_FAULT_MINOR in the returned value, so the definitive answer is in

> > how that return value gets set.  The handle_mm_fault value comes from

> > called function hugetlb_fault or handle_pte_fault (again, ignoring

> > error conditions).  I'd suggest starting your inquiry by looking at

> > the logic in handle_pte_fault.

> >

> >> Also, can anyone please confirm that in 2.6 kernels, page cache and

> >> buffer cache are normalized ?. Now we have only one cache, which

> >> includes both buffer cache and page cache.

> >

> > They were last I heard.  Things move so fast these days that I can't

> > keep up!  :)

> >

-- 
Regards,
Peter Teoh