On Tue, Nov 10, 2009 at 7:30 AM, Shameem Ahamed <shameem.ahamed@xxxxxxxxx> wrote:
Sure, but there are many answer possible, mine not nec correct.
looking into a dynamic stack trace (using stap):
11647 0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
11648 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11649 0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
11650 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11651 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11652 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11653 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11654 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11655 0xffffffff81150fb2 : get_super+0x39/0x112 [kernel] (inexact)
11656 0xffffffff81187fb7 : flush_disk+0x1d/0xc8 [kernel] (inexact)
11657 0xffffffff811880d8 : check_disk_change+0x76/0x87 [kernel] (inexact)
11658 0xffffffff8105c536 : finish_task_switch+0x4f/0x151 [kernel] (inexact)
11659 0xffffffff81503990 : thread_return+0x115/0x17e [kernel] (inexact)
11660 0xffffffff812644db : kobject_get+0x28/0x37 [kernel] (inexact)
11661 0xffffffff81189557 : __blkdev_get+0xf5/0x4e9 [kernel] (inexact)
From above, the blk_fetch_request() functions start after blk_start_request() has started, which is the start of the I/O processing.
1873 /**
1874 * blk_start_request - start request processing on the driver
1875 * @req: request to dequeue
1876 *
1877 * Description:
1878 * Dequeue @req and start timeout timer on it. This hands off the
1879 * request to the driver.
1880 *
1881 * Block internal functions which don't want to start timer should
1882 * call blk_dequeue_request().
1883 *
1884 * Context:
1885 * queue_lock must be held.
1886 */
1887 void blk_start_request(struct request *req)
1888 {
API of blk_* are defined in include/linux/blkdev.h - take a look and u can see that the APIs are based on sectors:
921 extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
922 spinlock_t *lock, int node_id);
923 extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
924 extern void blk_cleanup_queue(struct request_queue *);
925 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
926 extern void blk_queue_bounce_limit(struct request_queue *, u64);
927 extern void blk_queue_max_sectors(struct request_queue *, unsigned int);
928 extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
929 extern void blk_queue_max_phys_segments(struct request_queue *, unsigned short);
930 extern void blk_queue_max_hw_segments(struct request_queue *, unsigned short);
931 extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
932 extern void blk_queue_max_discard_sectors(struct request_queue *q,
for example one of these function:
628
629 unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p)
630 {
631 struct address_space *mapping = bdev->bd_inode->i_mapping;
632 struct page *page;
633
634 page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_CACHE_SHIFT-9)),
635 NULL);
636 if (!IS_ERR(page)) {
637 if (PageError(page))
638 goto fail;
639 p->v = page;
640 return (unsigned char *)page_address(page) + ((n & ((1 << (PAGE_CACHE_SHIFT - 9)) - 1)) << 9);
641 fail:
642 page_cache_release(page);
643 }
644 p->v = NULL;
645 return NULL;
646 }
from above, read_mapping_page() will read the data based on the address space "mapping". these mapping (which also contain the byte/sector offset information to read) will ultimately get mapped into the hardware drivers (eg, depending whether it is SATA or IDE) API:
in drivers/ata/libata-core.c there is one read function:
5221 /**
5222 * sata_scr_read - read SCR register of the specified port
5223 * @link: ATA link to read SCR for
5224 * @reg: SCR to read
5225 * @val: Place to store read value
5226 *
5227 * Read SCR register @reg of @link into *@val. This function is
5228 * guaranteed to succeed if @link is ap->link, the cable type of
5229 * the port is SATA and the port implements ->scr_read.
5230 *
5231 * LOCKING:
5232 * None if @link is ap->link. Kernel thread context otherwise.
5233 *
5234 * RETURNS:
5235 * 0 on success, negative errno on failure.
5236 */
5237 int sata_scr_read(struct ata_link *link, int reg, u32 *val)
5238 {
5239 if (ata_is_host_link(link)) {
5240 if (sata_scr_valid(link))
5241 return link->ap->ops->scr_read(link, reg, val);
5242 return -EOPNOTSUPP;
5243 }
5244
5245 return sata_pmp_scr_read(link, reg, val);
5246 }
this will read from the hardware specific API function pointer in scr_read(). For example, in Marvell SATA, it is implemented as mv_scr_read() (in sata_mv.c):
1316 static int mv_scr_read(struct ata_link *link, unsigned int sc_reg_in, u32 *val)
1317 {
1318 unsigned int ofs = mv_scr_offset(sc_reg_in);
1319
1320 if (ofs != 0xffffffffU) {
1321 *val = readl(mv_ap_base(link->ap) + ofs);
1322 return 0;
1323 } else
1324 return -EINVAL;
1325 }
which is the ultimate 4 byte read from the DMA memory. (which is DMA from/to the harddisk, so effectively reading from harddisk)
More info can be found in the libata developer's guide.
More stap traces will show a lot more variation in how reading can be triggered:
7243 kblockd/0(132): <- blk_fetch_request
0 hald-addon-stor(2412): -> blk_fetch_request
0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
0xffffffff81241245 : elv_insert+0xea/0x2db [kernel]
0xffffffff81241516 : __elv_add_request+0xe0/0xef [kernel]
0xffffffff81253fb9 : blk_execute_rq_nowait+0xb1/0x109 [kernel]
0xffffffff812540b5 : blk_execute_rq+0xa4/0xdf [kernel]
0xffffffff8124d7d5 : blk_put_request+0x57/0x66 [kernel] (inexact)
0xffffffff81259dfd : scsi_cmd_ioctl+0x755/0x771 [kernel] (inexact)
0xffffffff812644db : kobject_get+0x28/0x37 [kernel] (inexact)
0xffffffff81257440 : get_disk+0x108/0x13b [kernel] (inexact)
0xffffffff81189847 : __blkdev_get+0x3e5/0x4e9 [kernel] (inexact)
0xffffffff81255c58 : __blkdev_driver_ioctl+0x80/0xb1 [kernel] (inexact)
0xffffffff81256abb : blkdev_ioctl+0xd8a/0xdb0 [kernel] (inexact)
6670 hald-addon-stor(2412): <- blk_fetch_request
0 hald-addon-stor(2412): -> blk_fetch_request
0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
0xffffffff81241245 : elv_insert+0xea/0x2db [kernel]
0xffffffff81241516 : __elv_add_request+0xe0/0xef [kernel]
0xffffffff81253fb9 : blk_execute_rq_nowait+0xb1/0x109 [kernel]
0xffffffff812540b5 : blk_execute_rq+0xa4/0xdf [kernel]
0xffffffff81189847 : __blkdev_get+0x3e5/0x4e9 [kernel] (inexact)
0xffffffff81256abb : blkdev_ioctl+0xd8a/0xdb0 [kernel] (inexact)
0xffffffff8118996b : blkdev_open+0x0/0x107 [kernel] (inexact)
0xffffffff8115f8d3 : do_filp_open+0x839/0xfd7 [kernel] (inexact)
0xffffffff813471bc : put_device+0x25/0x2e [kernel] (inexact)
0xffffffff81189284 : __blkdev_put+0xea/0x20f [kernel] (inexact)
0xffffffff811893c0 : blkdev_put+0x17/0x20 [kernel] (inexact)
0xffffffff8118941a : blkdev_close+0x51/0x5d [kernel] (inexact)
0xffffffff8114fed2 : __fput+0x1bb/0x308 [kernel] (inexact)
Hi Ed, Shailesh,
Thanks for the replies.
I have gone through the handle_pte_fault function in memory.c
It seems like, it handles VM page faults. I am more concerned with the physical page faults. As i can see from the code, it allocates new pages for the VMA. But if the VMA is backed by a disk file the contents of the file should also be read to the RAM. VM_FAULT_MINOR and VM_FAULT_MAJOR are related to VM minor faults and major faults.
I want to get more information regarding the physical page faults. Once the process is created VMA for the process created, and VM pages are allocated on demand (when the fault occurs) and the data will be read from DISK to RAM if it is not present.
Eg: I am running an application, called EG. When EG is started VMA for EG will be created, virtual pages will be allocated and a txt, data, and other required parts of EG will be loaded and mapped to the virtual pages.
I am looking out for the function which copies pages from disk to ram.
Can anyone please help me ?.
Sure, but there are many answer possible, mine not nec correct.
looking into a dynamic stack trace (using stap):
11647 0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
11648 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11649 0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
11650 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11651 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11652 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11653 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11654 0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
11655 0xffffffff81150fb2 : get_super+0x39/0x112 [kernel] (inexact)
11656 0xffffffff81187fb7 : flush_disk+0x1d/0xc8 [kernel] (inexact)
11657 0xffffffff811880d8 : check_disk_change+0x76/0x87 [kernel] (inexact)
11658 0xffffffff8105c536 : finish_task_switch+0x4f/0x151 [kernel] (inexact)
11659 0xffffffff81503990 : thread_return+0x115/0x17e [kernel] (inexact)
11660 0xffffffff812644db : kobject_get+0x28/0x37 [kernel] (inexact)
11661 0xffffffff81189557 : __blkdev_get+0xf5/0x4e9 [kernel] (inexact)
From above, the blk_fetch_request() functions start after blk_start_request() has started, which is the start of the I/O processing.
1873 /**
1874 * blk_start_request - start request processing on the driver
1875 * @req: request to dequeue
1876 *
1877 * Description:
1878 * Dequeue @req and start timeout timer on it. This hands off the
1879 * request to the driver.
1880 *
1881 * Block internal functions which don't want to start timer should
1882 * call blk_dequeue_request().
1883 *
1884 * Context:
1885 * queue_lock must be held.
1886 */
1887 void blk_start_request(struct request *req)
1888 {
API of blk_* are defined in include/linux/blkdev.h - take a look and u can see that the APIs are based on sectors:
921 extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
922 spinlock_t *lock, int node_id);
923 extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
924 extern void blk_cleanup_queue(struct request_queue *);
925 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
926 extern void blk_queue_bounce_limit(struct request_queue *, u64);
927 extern void blk_queue_max_sectors(struct request_queue *, unsigned int);
928 extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
929 extern void blk_queue_max_phys_segments(struct request_queue *, unsigned short);
930 extern void blk_queue_max_hw_segments(struct request_queue *, unsigned short);
931 extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
932 extern void blk_queue_max_discard_sectors(struct request_queue *q,
for example one of these function:
628
629 unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p)
630 {
631 struct address_space *mapping = bdev->bd_inode->i_mapping;
632 struct page *page;
633
634 page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_CACHE_SHIFT-9)),
635 NULL);
636 if (!IS_ERR(page)) {
637 if (PageError(page))
638 goto fail;
639 p->v = page;
640 return (unsigned char *)page_address(page) + ((n & ((1 << (PAGE_CACHE_SHIFT - 9)) - 1)) << 9);
641 fail:
642 page_cache_release(page);
643 }
644 p->v = NULL;
645 return NULL;
646 }
from above, read_mapping_page() will read the data based on the address space "mapping". these mapping (which also contain the byte/sector offset information to read) will ultimately get mapped into the hardware drivers (eg, depending whether it is SATA or IDE) API:
in drivers/ata/libata-core.c there is one read function:
5221 /**
5222 * sata_scr_read - read SCR register of the specified port
5223 * @link: ATA link to read SCR for
5224 * @reg: SCR to read
5225 * @val: Place to store read value
5226 *
5227 * Read SCR register @reg of @link into *@val. This function is
5228 * guaranteed to succeed if @link is ap->link, the cable type of
5229 * the port is SATA and the port implements ->scr_read.
5230 *
5231 * LOCKING:
5232 * None if @link is ap->link. Kernel thread context otherwise.
5233 *
5234 * RETURNS:
5235 * 0 on success, negative errno on failure.
5236 */
5237 int sata_scr_read(struct ata_link *link, int reg, u32 *val)
5238 {
5239 if (ata_is_host_link(link)) {
5240 if (sata_scr_valid(link))
5241 return link->ap->ops->scr_read(link, reg, val);
5242 return -EOPNOTSUPP;
5243 }
5244
5245 return sata_pmp_scr_read(link, reg, val);
5246 }
this will read from the hardware specific API function pointer in scr_read(). For example, in Marvell SATA, it is implemented as mv_scr_read() (in sata_mv.c):
1316 static int mv_scr_read(struct ata_link *link, unsigned int sc_reg_in, u32 *val)
1317 {
1318 unsigned int ofs = mv_scr_offset(sc_reg_in);
1319
1320 if (ofs != 0xffffffffU) {
1321 *val = readl(mv_ap_base(link->ap) + ofs);
1322 return 0;
1323 } else
1324 return -EINVAL;
1325 }
which is the ultimate 4 byte read from the DMA memory. (which is DMA from/to the harddisk, so effectively reading from harddisk)
More info can be found in the libata developer's guide.
More stap traces will show a lot more variation in how reading can be triggered:
7243 kblockd/0(132): <- blk_fetch_request
0 hald-addon-stor(2412): -> blk_fetch_request
0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
0xffffffff81241245 : elv_insert+0xea/0x2db [kernel]
0xffffffff81241516 : __elv_add_request+0xe0/0xef [kernel]
0xffffffff81253fb9 : blk_execute_rq_nowait+0xb1/0x109 [kernel]
0xffffffff812540b5 : blk_execute_rq+0xa4/0xdf [kernel]
0xffffffff8124d7d5 : blk_put_request+0x57/0x66 [kernel] (inexact)
0xffffffff81259dfd : scsi_cmd_ioctl+0x755/0x771 [kernel] (inexact)
0xffffffff812644db : kobject_get+0x28/0x37 [kernel] (inexact)
0xffffffff81257440 : get_disk+0x108/0x13b [kernel] (inexact)
0xffffffff81189847 : __blkdev_get+0x3e5/0x4e9 [kernel] (inexact)
0xffffffff81255c58 : __blkdev_driver_ioctl+0x80/0xb1 [kernel] (inexact)
0xffffffff81256abb : blkdev_ioctl+0xd8a/0xdb0 [kernel] (inexact)
6670 hald-addon-stor(2412): <- blk_fetch_request
0 hald-addon-stor(2412): -> blk_fetch_request
0xffffffff8124d466 : blk_fetch_request+0x0/0x46 [kernel]
0xffffffff81508876 : kretprobe_trampoline+0x0/0x53 [kernel]
0xffffffff8124dffd : __blk_run_queue+0x12c/0x2cc [kernel]
0xffffffff81241245 : elv_insert+0xea/0x2db [kernel]
0xffffffff81241516 : __elv_add_request+0xe0/0xef [kernel]
0xffffffff81253fb9 : blk_execute_rq_nowait+0xb1/0x109 [kernel]
0xffffffff812540b5 : blk_execute_rq+0xa4/0xdf [kernel]
0xffffffff81189847 : __blkdev_get+0x3e5/0x4e9 [kernel] (inexact)
0xffffffff81256abb : blkdev_ioctl+0xd8a/0xdb0 [kernel] (inexact)
0xffffffff8118996b : blkdev_open+0x0/0x107 [kernel] (inexact)
0xffffffff8115f8d3 : do_filp_open+0x839/0xfd7 [kernel] (inexact)
0xffffffff813471bc : put_device+0x25/0x2e [kernel] (inexact)
0xffffffff81189284 : __blkdev_put+0xea/0x20f [kernel] (inexact)
0xffffffff811893c0 : blkdev_put+0x17/0x20 [kernel] (inexact)
0xffffffff8118941a : blkdev_close+0x51/0x5d [kernel] (inexact)
0xffffffff8114fed2 : __fput+0x1bb/0x308 [kernel] (inexact)
Regards,
Shameem
----- Original Message ----
> From: shailesh jain <coolworldofshail@xxxxxxxxx>
> To: Ed Cashin <ecashin@xxxxxxxxxx>
> Cc: kernelnewbies@xxxxxxxxxxxx
> Sent: Tue, November 10, 2009 3:23:59 AM
> Subject: Re: Difference between major page fault and minor page fault
>
> Minor faults can occur at many places:
>
> 1) Shared pages among processes/ Swap cache - A process can take a
> page fault when the page is already present in the swap cache. This
> will be minor fault since you will not go to disk.
>
> 2) COW but no fork - Memory is not allocated initially for malloc. It
> will point to global zero page, however when the process attempts to
> write to it you will get minor page fault.
>
> 3) Stack is expanding. Check if the fault occurred closed to the
> bottom the stack, if yes then let's allocate page under the assumption
> that stack is expanding.
>
> 4) Vmalloc address space. Page fault can occur in kernel address space
> for vmalloc area. When this happens you sync up process' page tables
> with master page table (init_mm).
>
> 5) COW for fork.
>
>
> Shailesh Jain
>
> On Mon, Nov 9, 2009 at 7:08 AM, Ed Cashin wrote:
> > Shameem Ahamed writes:
> >
> >> Hi,
> >>
> >> Can anyone explain the difference between major and minor page faults.
> >>
> >> As far as I know, major page fault follows a disk access to retrieve
> >> the data. Minor page fault occurs mainly for COW pages. Is there any
> >> other instances other than COW, where there will be a minor page
> >> fault. Which kernel function handles the major page fault ?.
> >
> > Ignoring error cases, arch/x86/mm/fault.c:do_page_fault calls
> > mm/memory.c:handle_mm_fault and looks for the flags, VM_FAULT_MAJOR or
> > VM_FAULT_MINOR in the returned value, so the definitive answer is in
> > how that return value gets set. The handle_mm_fault value comes from
> > called function hugetlb_fault or handle_pte_fault (again, ignoring
> > error conditions). I'd suggest starting your inquiry by looking at
> > the logic in handle_pte_fault.
> >
> >> Also, can anyone please confirm that in 2.6 kernels, page cache and
> >> buffer cache are normalized ?. Now we have only one cache, which
> >> includes both buffer cache and page cache.
> >
> > They were last I heard. Things move so fast these days that I can't
> > keep up! :)
> >
--
Regards,
Peter Teoh