Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
> On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> > > On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
> > > > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> > > > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > > > > > Shaohua,
> > > > > >
> > > > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > > > > > Hi,
> > > > > > >   We have file readahead to do asyn file read, but has no metadata
> > > > > > > readahead. For a list of files, their metadata is stored in fragmented
> > > > > > > disk space and metadata read is a sync operation, which impacts the
> > > > > > > efficiency of readahead much. The patches try to add meatadata readahead
> > > > > > > for btrfs.
> > > > > > >   In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > > the inode to a fd so we could use existing syscalls (readahead, mincore
> > > > > > > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > > > > > > no easy way for this from my understanding. So we add two ioctls for
> > > > > >
> > > > > > If that is the main obstacle, why not do straightforward fincore()/
> > > > > > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > > > > > btree_inode in any form?  This will address btrfs' specific issue, and
> > > > > > have the benefit of making the VFS part general enough. You know
> > > > > > ext2/3/4 already have block_dev ready for metadata readahead.
> > > > > I forgot to update this comment. Please see patch 2 and patch 4, both
> > > > > incore and readahead need btrfs specific staff involved, so we can't use
> > > > > generic fincore or something.
> > > >
> > > > You can if you like :)
> > > >
> > > > - fincore() can return the referenced bit, which is generally
> > > >   useful information
> > > metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
> > > we can't blindly filter out such pages with the bit.
> >
> > block_dev inodes have the accessed bits. Look at the below output.
> >
> > /dev/sda5 is a mounted ext4 partition.  The 'A'/'R' in the
> > dump_page_cache lines stand for Active/Referenced.
> ext4 already does readahead? please check other filesystems.

ext3/4 does readahead on accessing large directories. However that's
orthogonal feature to the user space metadata readahead. The latter is
still important for fast boot on ext3/4.

> filesystem sues bread like API to read metadata, which definitely
> doesn't set referenced bit.

__find_get_block() will call touch_buffer() which is a synonymous for
mark_page_accessed().

> > root@bay /home/wfg# echo /dev/sda5 > /debug/tracing/objects/mm/pages/dump-file
> > root@bay /home/wfg# cat /debug/tracing/trace
> > # tracer: nop
> > #
> > #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> > #              | |       |          |         |
> >              zsh-2950  [003]   879.500764: dump_inode_cache:            0  55643986944      1703936        21879 D___  BLK            mount /dev/sda5
> >              zsh-2950  [003]   879.500774: dump_page_cache:            0      2 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500776: dump_page_cache:            2      3 ____R_____P    2    0
> >              zsh-2950  [003]   879.500777: dump_page_cache:         1026      5 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500778: dump_page_cache:         1031      3 ___A______P    2    0
> >              zsh-2950  [003]   879.500779: dump_page_cache:         1034      1 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500780: dump_page_cache:         1035      2 ___A______P    2    0
> >              zsh-2950  [003]   879.500781: dump_page_cache:         1037      1 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500782: dump_page_cache:         1038      3 ____R_____P    2    0
> >              zsh-2950  [003]   879.500782: dump_page_cache:         1041      1 ___A______P    2    0
> >              zsh-2950  [003]   879.500783: dump_page_cache:         1057      1 ___AR_D___P    2    0
> >              zsh-2950  [003]   879.500788: dump_page_cache:         1058      6 ___A______P    2    0
> >              zsh-2950  [003]   879.500788: dump_page_cache:         9249      1 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500789: dump_page_cache:       524289      1 ____R_____P    2    0
> >              zsh-2950  [003]   879.500790: dump_page_cache:       524290      2 ___A______P    2    0
> >              zsh-2950  [003]   879.500790: dump_page_cache:       524292      1 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500791: dump_page_cache:       524293      1 ___A______P    2    0
> >              zsh-2950  [003]   879.500796: dump_page_cache:       524294      9 ____R_____P    2    0
> >              zsh-2950  [003]   879.500797: dump_page_cache:       524303      1 ___A______P    2    0
> >              zsh-2950  [003]   879.500798: dump_page_cache:       987136      1 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500798: dump_page_cache:      1048576      1 ____R_____P    2    0
> >              zsh-2950  [003]   879.500799: dump_page_cache:      1048577      2 ___A______P    2    0
> >              zsh-2950  [003]   879.500800: dump_page_cache:      1048579      1 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500801: dump_page_cache:      1048580      5 ___A______P    2    0
> >              zsh-2950  [003]   879.500802: dump_page_cache:      1048585      1 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500805: dump_page_cache:      1048586      5 ___A______P    2    0
> >              zsh-2950  [003]   879.500805: dump_page_cache:      1048591      1 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500806: dump_page_cache:      1572864      1 ____R_____P    2    0
> >              zsh-2950  [003]   879.500807: dump_page_cache:      1572865      5 ___A______P    2    0
> >              zsh-2950  [003]   879.500808: dump_page_cache:      1572870      1 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500811: dump_page_cache:      1572871      6 ___A______P    2    0
> >              zsh-2950  [003]   879.500812: dump_page_cache:      1572877      3 ____R_____P    2    0
> >              zsh-2950  [003]   879.500816: dump_page_cache:      2097153      8 ____R_____P    2    0
> >              zsh-2950  [003]   879.500817: dump_page_cache:      2097161      1 ___A______P    2    0
> >              zsh-2950  [003]   879.500818: dump_page_cache:      2097162      4 ____R_____P    2    0
> >              zsh-2950  [003]   879.500819: dump_page_cache:      6324224      1 ____R_D___P    2    0
> >              zsh-2950  [003]   879.500820: dump_page_cache:      6324225      3 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500825: dump_page_cache:      6324228     29 ___A______P    2    0
> >              zsh-2950  [003]   879.500826: dump_page_cache:      6324257      1 ____R_____P    2    0
> >              zsh-2950  [003]   879.500828: dump_page_cache:      6324258      4 ___A______P    2    0
> >              zsh-2950  [003]   879.500830: dump_page_cache:      6324262     11 ____R_____P    2    0
> >              zsh-2950  [003]   879.500833: dump_page_cache:      6324273     16 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500833: dump_page_cache:      6324289      1 ___A______P    2    0
> >              zsh-2950  [003]   879.500834: dump_page_cache:      6324290      2 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500835: dump_page_cache:      6324292      8 ___A______P    2    0
> >              zsh-2950  [003]   879.500836: dump_page_cache:      6324300      2 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500837: dump_page_cache:      6324302      3 ___A______P    2    0
> >              zsh-2950  [003]   879.500838: dump_page_cache:      6324305      4 ____R_____P    2    0
> >              zsh-2950  [003]   879.500843: dump_page_cache:      6324309     28 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500844: dump_page_cache:      6324337      4 ___A______P    2    0
> >              zsh-2950  [003]   879.500845: dump_page_cache:      6324341      2 ____R_____P    2    0
> >              zsh-2950  [003]   879.500850: dump_page_cache:      6324343     30 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500851: dump_page_cache:      6324373      2 ___A______P    2    0
> >              zsh-2950  [003]   879.500852: dump_page_cache:      6324375      2 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500853: dump_page_cache:      6324377      9 ___A______P    2    0
> >              zsh-2950  [003]   879.500854: dump_page_cache:      6324386      2 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500855: dump_page_cache:      6324388      5 ___A______P    2    0
> >              zsh-2950  [003]   879.500856: dump_page_cache:      6324393      3 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500858: dump_page_cache:      6324396     11 ___A______P    2    0
> >              zsh-2950  [003]   879.500859: dump_page_cache:      6324407      1 ____R_____P    2    0
> >              zsh-2950  [003]   879.500864: dump_page_cache:      6324408     31 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500864: dump_page_cache:      6324439      1 ___A______P    2    0
> >              zsh-2950  [003]   879.500865: dump_page_cache:      6324440      1 ____R_____P    2    0
> >              zsh-2950  [003]   879.500866: dump_page_cache:      6324441      2 ___A______P    2    0
> >              zsh-2950  [003]   879.500867: dump_page_cache:      6324443      5 ____R_____P    2    0
> >              zsh-2950  [003]   879.500872: dump_page_cache:      6324448     26 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500873: dump_page_cache:      6324474      6 ___A______P    2    0
> >              zsh-2950  [003]   879.500874: dump_page_cache:      6324480      4 ____R_____P    2    0
> >              zsh-2950  [003]   879.500879: dump_page_cache:      6324484     28 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500880: dump_page_cache:      6324512      4 ___A______P    2    0
> >              zsh-2950  [003]   879.500881: dump_page_cache:      6324516      1 ____R_____P    2    0
> >              zsh-2950  [003]   879.500881: dump_page_cache:      6324517      1 ___A______P    2    0
> >              zsh-2950  [003]   879.500882: dump_page_cache:      6324518      2 ___AR_____P    2    0
> >              zsh-2950  [003]   879.500888: dump_page_cache:      6324520     28 ___A______P    2    0
> >              zsh-2950  [003]   879.500890: dump_page_cache:      6324548      2 ____R_____P    2    0
> >
> > > fincore can takes a parameter or it returns a bit to distinguish
> > > referenced pages, but I don't think it's a good API. This should be
> > > transparent to userspace.
> >
> > Users care about the "cached" status may well be interested in the
> > "active/referenced" status. They are co-related information. fincore()
> > won't be a simple replication of mincore() anyway. fincore() has to
> > deal with huge sparsely accessed files. The accessed bits of a file
> > page are normally more meaningful than the accessed bits of mapped
> > (anonymous) pages.
> if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.

It's a reasonable thing to set the accessed bits. So I believe the
various filesystems are calling mark_page_accessed() on their metadata
inode, or can be changed to do it.

> > Another option may be to use the above
> > /debug/tracing/objects/mm/pages/dump-file interface.
> >
> > > > - btrfs_metadata_readahead() can be passed to some (faked)
> > > >   ->readpages() for use with fadvise.
> > > this need filesystem specific hook too, the difference is your proposal
> > > uses fadvise but I'm using ioctl. There isn't big difference.
> >
> > True for btrfs. However they make big differences for other file systems.
> why?

The block_dev of ext2/3/4 can do metadata query/readahead directly
with fincore()+fadvise(), with no need for any additional ioctls.

Given that the vast majority desktops are running ext2/3/4, it seems
worthwhile to have a straightforward solution for them.

> > > BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I
> > > didn't find a easy way to do this. It might be possible to do this for
> > > example adding a fake device or fake fs (anon_inode doesn't work here,
> > > IIRC), which is a bit ugly. Before it's proved generic API can handle
> > > metadata readahead, I don't want to do it.
> >
> > Right, it could be hard to export btrfs_inode. I'm glad you speak it
> > out. If we cannot make it, it's valuable to point out the problem and
> > let everyone know the root cause we turn to an ioctl based workaround.
> > Then others will understand the design choices, and if lucky, join us
> > and help export the btrfs_inode.
> I didn't hide anything. I actually tell out this in the comments. this
> is what I said.

Ah, sorry for overlooking this message!

Thanks,
Fengguang

>  In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > > the inode to a fd so we could use existing syscalls
> (readahead, mincore
> > > > > > > or upcoming fincore) to do readahead, but the inode is
> hidden, there is
> > > > > > > no easy way for this from my understanding.
> 
> 
> Thanks,
> Shaohua
> > > > > > > this. One is like readahead syscall, the other is like micore/fincore
> > > > > > > syscall.
> > > > > > >   Under a harddisk based netbook with Meego, the metadata readahead
> > > > > > > reduced about 3.5s boot time in average from total 16s.
> > > > > > >   Last time I posted similar patches to btrfs maillist, which adds the
> > > > > > > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> > > > > > > have a generic interface to do this so other filesystem can share some
> > > > > > > code, so I came up with the new one. Comments and suggestions are
> > > > > > > welcome!
> > > > > > >
> > > > > > > v1->v2:
> > > > > > > 1. Added more comments and fix return values suggested by Andrew Morton
> > > > > > > 2. fix a race condition pointed out by Yan Zheng
> > > > > > >
> > > > > > > initial post:
> > > > > > > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Shaohua
> > > > > > >
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > >
> > >
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux