On Thu, Jan 20, 2011 at 12:10:14PM +0800, Andrew Morton wrote: > On Thu, 20 Jan 2011 11:21:49 +0800 Shaohua Li <shaohua.li@xxxxxxxxx> wrote: > > > > It seems to return a single offset/length tuple which refers to the > > > btrfs metadata "file", with the intent that this tuple later be fed > > > into a btrfs-specific readahead ioctl. > > > > > > I can see how this might be used with say fatfs or ext3 where all > > > metadata resides within the blockdev address_space. But how is a > > > filesytem which keeps its metadata in multiple address_spaces supposed > > > to use this interface? > > Oh, this looks like a big problem, thanks for letting me know such > > filesystems. is it possible specific filesystem mapping multiple > > address_space ranges to a virtual big ranges? the new ioctls handle the > > mapping. > > I'm not sure what you mean by that. > > ext2, minix and probably others create an address_space for each > directory. Heaven knows what xfs does (for example). > > > If the issue can't be solved, we can only add the metadata readahead for > > specific implementation like my initial post instead of a generic > > interface. > > Well. One approach would be for the kernel to report the names of all > presently-cached files. And for each file, report the offsets of all > the pages which are presently in pagecache. This all gets put into a > database. > > At cold-boot time we open all those files and read the relevant files. > > To optimise that further, userspace would need to use fibmap to work > out the LBA(s) of each page, and then read the pages in an optimised order. > > To optimise that even further, userspace would need to find the on-disk > locations all the metadata for each file, generate the metadata->data > dependencies and then incorporate that into the reading order. > > I actually wrote code to do all this. Gad, it was ten years ago. I > forget how it works, but I do recall that it pioneered the technology > of doing (effecticely) a sys_write(1, ...) from a kernel module, so the > module's output appears on modprobe's stdout and can be redirected to > another file or a pipe. So sue me! It's in > http://userweb.kernel.org/~akpm/stuff/fboot.tar.gz. Good luck with > that ;) > > <looks> > > It walked mem_map[], indentifying pagecache pages, walking back from > the page* all the way to the filename then logging the pathname and the > file's pagecache indexes. It also handled the blockdev superblock, > where all the ext3 metadata resides. > There are much smarter ways of doing this of course, especially with > the vfs data structures which we later added. Yup :) The attached patch walks sb->s_inodes and dumps a ordered view of all cached file pages. It will list each cached files and pages in the order of the struct inode create time. The patch will record and show the command name that first opened the file. (At the time we dump the page cache, the task may no longer exists.) Although the field is very useful in some cases, it does add runtime overheads. I'm not sure how to balance this situation. Adding a compile time option? But then the trace output becomes dependent on kernel configuration, which may confuse user space tools (at least the dumb ones). Otherwise the patch is good enough for wider review. Here is a trimmed example output. root@bay /home/wfg# echo / > /debug/tracing/objects/mm/pages/dump-fs root@bay /home/wfg# cat /debug/tracing/trace The output are made of intermixed lines for inode and page. The corresponding field names are: file lines: ino size cached age(ms) dirty type first-opened-by file-name page lines: index len page-flags count mapcount 1507329 4096 8192 309042 ____ DIR swapper / 0 2 ____RU_____ 1 0 1786836 12288 40960 309026 ____ DIR swapper /sbin 0 10 ___ARU_____ 1 0 1786946 37312 40960 309024 ____ REG swapper /sbin/init 0 6 M__ARU_____ 2 1 6 1 M__A_U_____ 2 1 7 1 M__ARU_____ 2 1 8 2 _____U_____ 1 0 1507464 4 4096 309022 ____ LNK swapper /lib64 0 1 ___ARU_____ 1 0 1590173 12288 0 309021 ____ DIR swapper /lib 4563326 12 4096 309020 ____ LNK swapper /lib/ld-linux-x86-64.so.2 0 1 ___ARU_____ 1 0 4563295 128744 131072 309019 ____ REG swapper /lib/ld-2.11.2.so 0 1 M__ARU_____ 21 20 1 3 M__ARU_____ 17 16 4 4 M__ARU_____ 20 19 8 2 M__ARU_____ 27 26 10 3 M__ARU_____ 20 19 13 1 M__ARU_____ 27 26 14 1 M__ARU_____ 26 25 15 1 M__ARU_____ 20 19 16 1 M__ARU_____ 18 17 17 1 M__ARU_____ 9 8 18 1 M__A_U_____ 4 3 19 1 M__ARU_____ 27 26 20 1 M__ARU_____ 17 16 21 1 M__ARU_____ 20 19 22 1 M__ARU_____ 27 26 23 1 M__ARU_____ 20 19 24 1 M__ARU_____ 26 25 25 1 _____U_____ 1 0 26 1 M__A_U_____ 4 3 27 1 M__ARU_____ 20 19 28 4 _____U_____ 1 0 1525477 12288 0 309011 ____ DIR init /etc 1526463 64634 65536 309009 ____ REG init /etc/ld.so.cache 0 1 ___ARU_____ 1 0 1 1 _____U_____ 1 0 2 13 ___ARU_____ 1 0 15 1 ____RU_____ 1 0 1590258 241632 241664 309005 ____ REG init /lib/libsepol.so.1 0 5 M__ARU_____ 2 1 5 42 _____U_____ 1 0 47 1 M__ARU_____ 2 1 48 11 _____U_____ 1 0 1590330 117848 118784 308989 ____ REG init /lib/libselinux.so.1 0 1 M__ARU_____ 7 6 1 4 M__ARU_____ 4 3 5 1 M__ARU_____ 5 4 6 5 _____U_____ 1 0 11 2 M__ARU_____ 4 3 13 5 _____U_____ 1 0 18 1 ___ARU_____ 1 0 19 2 _____U_____ 1 0 21 1 M__ARU_____ 5 4 22 7 _____U_____ 1 0 4563314 14 4096 308982 ____ LNK init /lib/libc.so.6 0 1 ___ARU_____ 1 0 4563283 1432968 1433600 308981 ____ REG init /lib/libc-2.11.2.so 0 3 M__ARU_____ 27 26 3 1 M__ARU_____ 25 24 4 2 M__ARU_____ 23 22 6 1 M__ARU_____ 26 25 7 1 M__ARU_____ 22 21 8 1 M__ARU_____ 27 26 9 2 M__ARU_____ 25 24 11 1 M__ARU_____ 23 22 12 1 M__ARU_____ 25 24 13 1 M__ARU_____ 24 23 14 1 M__ARU_____ 25 24 15 3 M__ARU_____ 24 23 18 3 M__ARU_____ 26 25 21 2 M__ARU_____ 27 26 23 7 M__ARU_____ 17 16 30 1 M__ARU_____ 29 28 31 1 M__ARU_____ 25 24 32 2 M__ARU_____ 4 3 34 1 M__ARU_____ 3 2 35 2 M__ARU_____ 4 3 37 1 M__ARU_____ 2 1 38 1 _____U_____ 1 0 39 1 M__ARU_____ 4 3 40 1 M__ARU_____ 13 12 41 1 M__ARU_____ 12 11 42 1 M__ARU_____ 5 4 43 1 M__ARU_____ 23 22 44 2 M__ARU_____ 6 5 46 1 ___ARU_____ 1 0 47 1 M__ARU_____ 12 11 48 1 M__ARU_____ 4 3 49 1 M__ARU_____ 18 17 50 1 M__ARU_____ 29 28 51 2 M__ARU_____ 2 1 53 1 M__ARU_____ 27 26 54 1 M__ARU_____ 19 18 55 1 M__ARU_____ 25 24 56 2 _____U_____ 1 0 58 2 M__ARU_____ 2 1 60 2 _____U_____ 1 0 62 1 M__A_U_____ 2 1 63 1 _____U_____ 1 0 64 1 ___ARU_____ 1 0 65 3 M__ARU_____ 29 28 68 1 M__ARU_____ 21 20 69 1 M__ARU_____ 26 25 70 1 M__ARU_____ 9 8 71 1 M__ARU_____ 3 2 72 2 ___ARU_____ 1 0 74 2 _____U_____ 1 0 76 1 M__ARU_____ 27 26 77 2 M__ARU_____ 13 12 79 1 M__ARU_____ 9 8 80 1 M__ARU_____ 10 9 81 1 M__A_U_____ 2 1 82 1 M___RU_____ 4 3 83 1 M__ARU_____ 3 2 84 1 M__ARU_____ 16 15 85 1 M__ARU_____ 3 2 86 12 _____U_____ 1 0 98 1 M__ARU_____ 26 25 99 1 M__ARU_____ 25 24 100 2 M__ARU_____ 17 16 102 1 M__ARU_____ 25 24 103 1 M__ARU_____ 18 17 104 1 M__ARU_____ 14 13 105 3 _____U_____ 1 0 108 1 M__ARU_____ 12 11 109 2 M__ARU_____ 26 25 111 6 M__ARU_____ 30 29 117 1 M__ARU_____ 29 28 118 1 M__ARU_____ 30 29 119 1 M__ARU_____ 19 18 120 1 M__ARU_____ 22 21 121 1 M__ARU_____ 3 2 122 1 M__ARU_____ 28 27 123 1 M__ARU_____ 30 29 124 1 M__ARU_____ 11 10 125 1 M__ARU_____ 26 25 126 1 M__ARU_____ 22 21 127 2 M__ARU_____ 29 28 129 2 M__ARU_____ 5 4 131 1 M__ARU_____ 10 9 132 1 M__ARU_____ 25 24 133 2 M__ARU_____ 17 16 135 1 M__ARU_____ 3 2 136 6 _____U_____ 1 0 142 2 M__ARU_____ 3 2 144 1 M__ARU_____ 8 7 145 1 M__ARU_____ 22 21 146 3 M__ARU_____ 8 7 149 2 _____U_____ 1 0 151 3 M__ARU_____ 6 5 154 2 _____U_____ 1 0 156 1 M__ARU_____ 8 7 157 1 M__ARU_____ 10 9 158 1 M__ARU_____ 9 8 159 1 M__ARU_____ 8 7 160 1 M__ARU_____ 28 27 161 1 M__ARU_____ 30 29 162 1 M__ARU_____ 14 13 163 1 M____U_____ 2 1 164 2 _____U_____ 1 0 166 2 M__ARU_____ 4 3 168 1 M__ARU_____ 12 11 169 1 M__ARU_____ 10 9 170 1 M__ARU_____ 4 3 171 3 M__ARU_____ 3 2 174 6 ___ARU_____ 1 0 180 1 _____U_____ 1 0 181 9 ___ARU_____ 1 0 190 1 M__ARU_____ 4 3 191 1 ___A_U_____ 1 0 192 1 _____U_____ 1 0 193 1 ___A_U_____ 1 0 194 1 M__ARU_____ 30 29 195 1 M__ARU_____ 27 26 196 1 M__ARU_____ 17 16 197 2 _____U_____ 1 0 199 1 M__ARU_____ 27 26 200 1 M__ARU_____ 25 24 201 1 M__ARU_____ 2 1 202 1 M__ARU_____ 9 8 203 1 M__ARU_____ 26 25 204 1 M__ARU_____ 14 13 205 1 M__ARU_____ 4 3 206 1 M__ARU_____ 18 17 207 1 M__ARU_____ 26 25 208 1 M__ARU_____ 22 21 209 1 M__ARU_____ 2 1 210 1 M__ARU_____ 3 2 211 2 M____U_____ 2 1 213 5 _____U_____ 1 0 218 1 ___A_U_____ 1 0 > <googles> > > According to http://kerneltrap.org/node/2157 it sped up cold boot by > "10%", whatever that means. Seems that I wasn't sufficiently impressed > by that and got distracted. > > I'm not sure any of that was very useful, really. A full-on coldboot > optimiser really wants visibility into every disk block which need to > be read, and then mechanisms to tell the kernel to load those blocks > into the correct address_spaces. That's hard, because file data > depends on file metadata. A vast simplification would be to do it in > two disk passes: read all the metadata on pass 1 then all the data on > pass 2. Yes, that is what this patchset tries to do. > A totally different approach is to reorder all the data and metadata > on-disk, so no special cold-boot processing is needed at all. The boot time speedup mentioned in the changelog won't be possible without the physical data/metadata reordering. Fortunately btrfs makes it a trivial task. > And a third approach is to save all the cache into a special > file/partition/etc and to preload all that into kernel data structures > at boot. Obviously this one is ricky/tricky because the on-disk > replica of the real data can get out of sync with the real data. Hah! We are thinking much alike :) It's a very good optimization for LiveCDs and readonly mounted NFS /usr. For a typical desktop, the solution in my mind is to install some initscript to run at halt/reboot time, after all other tasks have been killed and filesystems remounted readonly. At the time it may dump whatever in the page cache to the swap partition. At the next boot, the data/metadata can then be read back _perfectly sequentially_ for populating the page cache. For kexec based reboot, the data can even be passed to next kernel directly, saving the disk IO totally. Thanks, Fengguang
--- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ mmotm/include/trace/events/mm.h 2010-12-26 20:59:48.000000000 +0800 @@ -0,0 +1,164 @@ +#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_MM_H + +#include <linux/tracepoint.h> +#include <linux/page-flags.h> +#include <linux/memcontrol.h> +#include <linux/pagemap.h> +#include <linux/mm.h> +#include <linux/kernel-page-flags.h> + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM mm + +extern struct trace_print_flags pageflag_names[]; + +/** + * dump_page_frame - called by the trace page dump trigger + * @pfn: page frame number + * @page: pointer to the page frame + * + * This is a helper trace point into the dumping of the page frames. + * It will record various infromation about a page frame. + */ +TRACE_EVENT(dump_page_frame, + + TP_PROTO(unsigned long pfn, struct page *page), + + TP_ARGS(pfn, page), + + TP_STRUCT__entry( + __field( unsigned long, pfn ) + __field( struct page *, page ) + __field( u64, stable_flags ) + __field( unsigned long, flags ) + __field( unsigned int, count ) + __field( unsigned int, mapcount ) + __field( unsigned long, private ) + __field( unsigned long, mapping ) + __field( unsigned long, index ) + ), + + TP_fast_assign( + __entry->pfn = pfn; + __entry->page = page; + __entry->stable_flags = stable_page_flags(page); + __entry->flags = page->flags; + __entry->count = atomic_read(&page->_count); + __entry->mapcount = page_mapcount(page); + __entry->private = page->private; + __entry->mapping = (unsigned long)page->mapping; + __entry->index = page->index; + ), + + TP_printk("%12lx %16p %8x %8x %16lx %16lx %16lx %s", + __entry->pfn, + __entry->page, + __entry->count, + __entry->mapcount, + __entry->private, + __entry->mapping, + __entry->index, + ftrace_print_flags_seq(p, "|", + __entry->flags & PAGE_FLAGS_MASK, + pageflag_names) + ) +); + +TRACE_EVENT(dump_page_cache, + + TP_PROTO(struct page *page, unsigned long len), + + TP_ARGS(page, len), + + TP_STRUCT__entry( + __field( unsigned long, index ) + __field( unsigned long, len ) + __field( u64, flags ) + __field( unsigned int, count ) + __field( unsigned int, mapcount ) + ), + + TP_fast_assign( + __entry->index = page->index; + __entry->len = len; + __entry->flags = stable_page_flags(page); + __entry->count = atomic_read(&page->_count); + __entry->mapcount = page_mapcount(page); + ), + + TP_printk("%12lu %6lu %c%c%c%c%c%c%c%c%c%c%c %4u %4u", + __entry->index, + __entry->len, + __entry->flags & (1ULL << KPF_MMAP) ? 'M' : '_', + __entry->flags & (1ULL << KPF_MLOCKED) ? 'm' : '_', + __entry->flags & (1ULL << KPF_UNEVICTABLE) ? 'u' : '_', + __entry->flags & (1ULL << KPF_ACTIVE) ? 'A' : '_', + __entry->flags & (1ULL << KPF_REFERENCED) ? 'R' : '_', + __entry->flags & (1ULL << KPF_UPTODATE) ? 'U' : '_', + __entry->flags & (1ULL << KPF_DIRTY) ? 'D' : '_', + __entry->flags & (1ULL << KPF_WRITEBACK) ? 'W' : '_', + __entry->flags & (1ULL << KPF_RECLAIM) ? 'I' : '_', + __entry->flags & (1ULL << KPF_MAPPEDTODISK) ? 'd' : '_', + __entry->flags & (1ULL << KPF_PRIVATE) ? 'P' : '_', + __entry->count, + __entry->mapcount) +); + + +#define show_inode_type(val) __print_symbolic(val, \ + { S_IFREG, "REG" }, \ + { S_IFDIR, "DIR" }, \ + { S_IFLNK, "LNK" }, \ + { S_IFBLK, "BLK" }, \ + { S_IFCHR, "CHR" }, \ + { S_IFIFO, "FIFO" }, \ + { S_IFSOCK, "SOCK" }) + +TRACE_EVENT(dump_inode_cache, + + TP_PROTO(struct inode *inode, char *name, int len), + + TP_ARGS(inode, name, len), + + TP_STRUCT__entry( + __field( unsigned long, ino ) + __field( loff_t, size ) /* bytes */ + __field( loff_t, cached ) /* bytes */ + __field( unsigned long, age ) /* ms */ + __field( unsigned long, state ) + __field( umode_t, mode ) + __array( char, comm, TASK_COMM_LEN) + __dynamic_array(char, file, len ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->size = i_size_read(inode); + __entry->cached = inode->i_mapping->nrpages; + __entry->cached <<= PAGE_CACHE_SHIFT; + __entry->age = (jiffies - inode->dirtied_when) * 1000 / HZ; + __entry->state = inode->i_state; + __entry->mode = inode->i_mode; + memcpy(__entry->comm, inode->i_comm, TASK_COMM_LEN); + memcpy(__get_str(file), name, len); + ), + + TP_printk("%12lu %12llu %12llu %12lu %c%c%c%c %4s %16s %s", + __entry->ino, + __entry->size, + __entry->cached, + __entry->age, + __entry->state & I_DIRTY_PAGES ? 'D' : '_', + __entry->state & I_DIRTY_DATASYNC ? 'd' : '_', + __entry->state & I_DIRTY_SYNC ? 'm' : '_', + __entry->state & I_SYNC ? 'S' : '_', + show_inode_type(__entry->mode & S_IFMT), + __entry->comm, + __get_str(file)) +); + +#endif /* _TRACE_MM_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> --- mmotm.orig/kernel/trace/Makefile 2010-12-26 20:58:46.000000000 +0800 +++ mmotm/kernel/trace/Makefile 2010-12-26 20:59:41.000000000 +0800 @@ -26,6 +26,7 @@ obj-$(CONFIG_RING_BUFFER) += ring_buffer obj-$(CONFIG_RING_BUFFER_BENCHMARK) += ring_buffer_benchmark.o obj-$(CONFIG_TRACING) += trace.o +obj-$(CONFIG_TRACING) += trace_objects.o obj-$(CONFIG_TRACING) += trace_output.o obj-$(CONFIG_TRACING) += trace_stat.o obj-$(CONFIG_TRACING) += trace_printk.o @@ -53,6 +54,7 @@ endif obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o obj-$(CONFIG_EVENT_TRACING) += power-traces.o +obj-$(CONFIG_EVENT_TRACING) += trace_mm.o ifeq ($(CONFIG_TRACING),y) obj-$(CONFIG_KGDB_KDB) += trace_kdb.o endif --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ mmotm/kernel/trace/trace_mm.c 2010-12-26 20:59:41.000000000 +0800 @@ -0,0 +1,367 @@ +/* + * Trace mm pages + * + * Copyright (C) 2009 Red Hat Inc, Steven Rostedt <srostedt@xxxxxxxxxx> + * + * Code based on Matt Mackall's /proc/[kpagecount|kpageflags] code. + */ +#include <linux/module.h> +#include <linux/bootmem.h> +#include <linux/debugfs.h> +#include <linux/uaccess.h> +#include <linux/ctype.h> +#include <linux/pagevec.h> +#include <linux/writeback.h> +#include <linux/file.h> +#include <linux/slab.h> + +#include "trace_output.h" + +#define CREATE_TRACE_POINTS +#include <trace/events/mm.h> + +void trace_mm_page_frames(unsigned long start, unsigned long end, + void (*trace)(unsigned long pfn, struct page *page)) +{ + unsigned long pfn = start; + struct page *page; + + if (start > max_pfn - 1) + return; + + if (end > max_pfn) + end = max_pfn; + + while (pfn < end) { + page = NULL; + if (pfn_valid(pfn)) + page = pfn_to_page(pfn); + pfn++; + if (page) + trace(pfn, page); + } +} + +static void trace_mm_page_frame(unsigned long pfn, struct page *page) +{ + trace_dump_page_frame(pfn, page); +} + +static ssize_t +trace_mm_pfn_range_read(struct file *filp, char __user *ubuf, size_t cnt, + loff_t *ppos) +{ + return simple_read_from_buffer(ubuf, cnt, ppos, "0\n", 2); +} + + +/* + * recognized formats: + * "M N" start=M, end=N + * "M" start=M, end=M+1 + * "M +N" start=M, end=M+N-1 + */ +static ssize_t +trace_mm_pfn_range_write(struct file *filp, const char __user *ubuf, size_t cnt, + loff_t *ppos) +{ + unsigned long start; + unsigned long end = 0; + char buf[64]; + char *ptr; + + if (cnt >= sizeof(buf)) + return -EINVAL; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + + if (tracing_update_buffers() < 0) + return -ENOMEM; + + if (trace_set_clr_event("mm", "dump_page_frame", 1)) + return -EINVAL; + + buf[cnt] = 0; + + start = simple_strtoul(buf, &ptr, 0); + + for (; *ptr; ptr++) { + if (isdigit(*ptr)) { + if (*(ptr - 1) == '+') + end = start; + end += simple_strtoul(ptr, NULL, 0); + break; + } + } + if (!*ptr) + end = start + 1; + + trace_mm_page_frames(start, end, trace_mm_page_frame); + + return cnt; +} + +static const struct file_operations trace_mm_fops = { + .open = tracing_open_generic, + .read = trace_mm_pfn_range_read, + .write = trace_mm_pfn_range_write, +}; + +static struct dentry *trace_objects_mm_dir(void) +{ + static struct dentry *d_mm; + struct dentry *d_objects; + + if (d_mm) + return d_mm; + + d_objects = trace_objects_dir(); + if (!d_objects) + return NULL; + + d_mm = debugfs_create_dir("mm", d_objects); + if (!d_mm) + pr_warning("Could not create 'objects/mm' directory\n"); + + return d_mm; +} + +static unsigned long page_flags(struct page *page) +{ + return page->flags & ((1 << NR_PAGEFLAGS) - 1); +} + +static int pages_similar(struct page *page0, struct page *page) +{ + if (page_flags(page0) != page_flags(page)) + return 0; + + if (page_count(page0) != page_count(page)) + return 0; + + if (page_mapcount(page0) != page_mapcount(page)) + return 0; + + return 1; +} + +static void dump_pagecache(struct address_space *mapping) +{ + unsigned long nr_pages; + struct page *pages[PAGEVEC_SIZE]; + struct page *uninitialized_var(page0); + struct page *page; + unsigned long start = 0; + unsigned long len = 0; + int i; + + for (;;) { + rcu_read_lock(); + nr_pages = radix_tree_gang_lookup(&mapping->page_tree, + (void **)pages, start + len, PAGEVEC_SIZE); + rcu_read_unlock(); + + if (nr_pages == 0) { + if (len) + trace_dump_page_cache(page0, len); + return; + } + + for (i = 0; i < nr_pages; i++) { + page = pages[i]; + + if (len && + page->index == start + len && + pages_similar(page0, page)) + len++; + else { + if (len) + trace_dump_page_cache(page0, len); + page0 = page; + start = page->index; + len = 1; + } + } + cond_resched(); + } +} + +static void dump_inode_cache(struct inode *inode, + char *name_buf, + struct vfsmount *mnt) +{ + struct path path = { + .mnt = mnt, + .dentry = d_find_alias(inode) + }; + char *name; + int len; + + if (!mnt) { + trace_dump_inode_cache(inode, name_buf, strlen(name_buf)); + return; + } + + if (!path.dentry) { + trace_dump_inode_cache(inode, "", 1); + return; + } + + name = d_path(&path, name_buf, PAGE_SIZE); + if (IS_ERR(name)) { + name = ""; + len = 1; + } else + len = PAGE_SIZE + name_buf - name; + + trace_dump_inode_cache(inode, name, len); + + if (path.dentry) + dput(path.dentry); +} + +static void dump_fs_pagecache(struct super_block *sb, struct vfsmount *mnt) +{ + struct inode *inode; + struct inode *prev_inode = NULL; + char *name_buf; + + name_buf = (char *)__get_free_page(GFP_TEMPORARY); + if (!name_buf) + return; + + down_read(&sb->s_umount); + if (!sb->s_root) + goto out; + + spin_lock(&inode_lock); + list_for_each_entry_reverse(inode, &sb->s_inodes, i_sb_list) { + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)) + continue; + __iget(inode); + spin_unlock(&inode_lock); + dump_inode_cache(inode, name_buf, mnt); + if (inode->i_mapping->nrpages) + dump_pagecache(inode->i_mapping); + iput(prev_inode); + prev_inode = inode; + cond_resched(); + spin_lock(&inode_lock); + } + spin_unlock(&inode_lock); + iput(prev_inode); +out: + up_read(&sb->s_umount); + free_page((unsigned long)name_buf); +} + +static ssize_t +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count, + loff_t *ppos) +{ + struct file *file = NULL; + char *name; + int err = 0; + + if (count <= 1) + return -EINVAL; + if (count >= PAGE_SIZE) + return -ENAMETOOLONG; + + name = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (!name) + return -ENOMEM; + + if (copy_from_user(name, ubuf, count)) { + err = -EFAULT; + goto out; + } + + /* strip the newline added by `echo` */ + if (name[count-1] == '\n') + name[count-1] = '\0'; + else + name[count] = '\0'; + + file = filp_open(name, O_RDONLY|O_LARGEFILE, 0); + if (IS_ERR(file)) { + err = PTR_ERR(file); + file = NULL; + goto out; + } + + if (tracing_update_buffers() < 0) { + err = -ENOMEM; + goto out; + } + if (trace_set_clr_event("mm", "dump_page_cache", 1)) { + err = -EINVAL; + goto out; + } + if (trace_set_clr_event("mm", "dump_inode_cache", 1)) { + err = -EINVAL; + goto out; + } + + if (filp->f_path.dentry->d_inode->i_private) { + dump_fs_pagecache(file->f_path.dentry->d_sb, file->f_path.mnt); + } else { + dump_inode_cache(file->f_mapping->host, name, NULL); + dump_pagecache(file->f_mapping); + } + +out: + if (file) + fput(file); + kfree(name); + + return err ? err : count; +} + +static const struct file_operations trace_pagecache_fops = { + .open = tracing_open_generic, + .read = trace_mm_pfn_range_read, + .write = trace_pagecache_write, +}; + +static struct dentry *trace_objects_mm_pages_dir(void) +{ + static struct dentry *d_pages; + struct dentry *d_mm; + + if (d_pages) + return d_pages; + + d_mm = trace_objects_mm_dir(); + if (!d_mm) + return NULL; + + d_pages = debugfs_create_dir("pages", d_mm); + if (!d_pages) + pr_warning("Could not create debugfs " + "'objects/mm/pages' directory\n"); + + return d_pages; +} + +static __init int trace_objects_mm_init(void) +{ + struct dentry *d_pages; + + d_pages = trace_objects_mm_pages_dir(); + if (!d_pages) + return 0; + + trace_create_file("dump-pfn", 0600, d_pages, NULL, + &trace_mm_fops); + + trace_create_file("dump-file", 0600, d_pages, NULL, + &trace_pagecache_fops); + + trace_create_file("dump-fs", 0600, d_pages, (void *)1, + &trace_pagecache_fops); + + return 0; +} +fs_initcall(trace_objects_mm_init); --- mmotm.orig/kernel/trace/trace.h 2010-12-26 20:58:46.000000000 +0800 +++ mmotm/kernel/trace/trace.h 2010-12-26 20:59:41.000000000 +0800 @@ -295,6 +295,7 @@ struct dentry *trace_create_file(const c const struct file_operations *fops); struct dentry *tracing_init_dentry(void); +struct dentry *trace_objects_dir(void); struct ring_buffer_event; --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ mmotm/kernel/trace/trace_objects.c 2010-12-26 20:59:41.000000000 +0800 @@ -0,0 +1,26 @@ +#include <linux/debugfs.h> + +#include "trace.h" +#include "trace_output.h" + +struct dentry *trace_objects_dir(void) +{ + static struct dentry *d_objects; + struct dentry *d_tracer; + + if (d_objects) + return d_objects; + + d_tracer = tracing_init_dentry(); + if (!d_tracer) + return NULL; + + d_objects = debugfs_create_dir("objects", d_tracer); + if (!d_objects) + pr_warning("Could not create debugfs " + "'objects' directory\n"); + + return d_objects; +} + + --- mmotm.orig/mm/page_alloc.c 2010-12-26 20:58:46.000000000 +0800 +++ mmotm/mm/page_alloc.c 2010-12-26 20:59:41.000000000 +0800 @@ -5493,7 +5493,7 @@ bool is_free_buddy_page(struct page *pag } #endif -static struct trace_print_flags pageflag_names[] = { +struct trace_print_flags pageflag_names[] = { {1UL << PG_locked, "locked" }, {1UL << PG_error, "error" }, {1UL << PG_referenced, "referenced" }, @@ -5541,7 +5541,7 @@ static void dump_page_flags(unsigned lon printk(KERN_ALERT "page flags: %#lx(", flags); /* remove zone id */ - flags &= (1UL << NR_PAGEFLAGS) - 1; + flags &= PAGE_FLAGS_MASK; for (i = 0; pageflag_names[i].name && flags; i++) { --- mmotm.orig/include/linux/page-flags.h 2010-12-26 20:58:46.000000000 +0800 +++ mmotm/include/linux/page-flags.h 2010-12-26 20:59:41.000000000 +0800 @@ -414,6 +414,7 @@ static inline void __ClearPageTail(struc * there has been a kernel bug or struct page corruption. */ #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1) +#define PAGE_FLAGS_MASK ((1 << NR_PAGEFLAGS) - 1) #define PAGE_FLAGS_PRIVATE \ (1 << PG_private | 1 << PG_private_2) --- mmotm.orig/fs/inode.c 2010-12-26 20:58:45.000000000 +0800 +++ mmotm/fs/inode.c 2010-12-26 21:00:09.000000000 +0800 @@ -182,7 +182,13 @@ int inode_init_always(struct super_block inode->i_bdev = NULL; inode->i_cdev = NULL; inode->i_rdev = 0; - inode->dirtied_when = 0; + + /* + * This records inode load time. It will be invalidated once inode is + * dirtied, or jiffies wraps around. Despite the pitfalls it still + * provides useful information for some use cases like fastboot. + */ + inode->dirtied_when = jiffies; if (security_inode_alloc(inode)) goto out; @@ -226,6 +232,9 @@ int inode_init_always(struct super_block percpu_counter_inc(&nr_inodes); + BUILD_BUG_ON(sizeof(inode->i_comm) != TASK_COMM_LEN); + memcpy(inode->i_comm, current->comm, TASK_COMM_LEN); + return 0; out: return -ENOMEM; --- mmotm.orig/include/linux/fs.h 2010-12-26 20:59:50.000000000 +0800 +++ mmotm/include/linux/fs.h 2010-12-26 21:00:09.000000000 +0800 @@ -800,6 +800,8 @@ struct inode { struct posix_acl *i_default_acl; #endif void *i_private; /* fs or device private pointer */ + + char i_comm[16]; /* first opened by */ }; static inline int inode_unhashed(struct inode *inode)