On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote: > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote: > > On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote: > > > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote: > > > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote: > > > > > Shaohua, > > > > > > > > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote: > > > > > > Hi, > > > > > > We have file readahead to do asyn file read, but has no metadata > > > > > > readahead. For a list of files, their metadata is stored in fragmented > > > > > > disk space and metadata read is a sync operation, which impacts the > > > > > > efficiency of readahead much. The patches try to add meatadata readahead > > > > > > for btrfs. > > > > > > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook > > > > > > the inode to a fd so we could use existing syscalls (readahead, mincore > > > > > > or upcoming fincore) to do readahead, but the inode is hidden, there is > > > > > > no easy way for this from my understanding. So we add two ioctls for > > > > > > > > > > If that is the main obstacle, why not do straightforward fincore()/ > > > > > fadvise(), and add ioctls to btrfs to export/grab the hidden > > > > > btree_inode in any form? This will address btrfs' specific issue, and > > > > > have the benefit of making the VFS part general enough. You know > > > > > ext2/3/4 already have block_dev ready for metadata readahead. > > > > I forgot to update this comment. Please see patch 2 and patch 4, both > > > > incore and readahead need btrfs specific staff involved, so we can't use > > > > generic fincore or something. > > > > > > You can if you like :) > > > > > > - fincore() can return the referenced bit, which is generally > > > useful information > > metadata page in ext2/3 doesn't have reference bit set, while btrfs has. > > we can't blindly filter out such pages with the bit. > > block_dev inodes have the accessed bits. Look at the below output. > > /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the > dump_page_cache lines stand for Active/Referenced. ext4 already does readahead? please check other filesystems. filesystem sues bread like API to read metadata, which definitely doesn't set referenced bit. > root@bay /home/wfg# echo /dev/sda5 > /debug/tracing/objects/mm/pages/dump-file > root@bay /home/wfg# cat /debug/tracing/trace > # tracer: nop > # > # TASK-PID CPU# TIMESTAMP FUNCTION > # | | | | | > zsh-2950 [003] 879.500764: dump_inode_cache: 0 55643986944 1703936 21879 D___ BLK mount /dev/sda5 > zsh-2950 [003] 879.500774: dump_page_cache: 0 2 ___AR_____P 2 0 > zsh-2950 [003] 879.500776: dump_page_cache: 2 3 ____R_____P 2 0 > zsh-2950 [003] 879.500777: dump_page_cache: 1026 5 ___AR_____P 2 0 > zsh-2950 [003] 879.500778: dump_page_cache: 1031 3 ___A______P 2 0 > zsh-2950 [003] 879.500779: dump_page_cache: 1034 1 ___AR_____P 2 0 > zsh-2950 [003] 879.500780: dump_page_cache: 1035 2 ___A______P 2 0 > zsh-2950 [003] 879.500781: dump_page_cache: 1037 1 ___AR_____P 2 0 > zsh-2950 [003] 879.500782: dump_page_cache: 1038 3 ____R_____P 2 0 > zsh-2950 [003] 879.500782: dump_page_cache: 1041 1 ___A______P 2 0 > zsh-2950 [003] 879.500783: dump_page_cache: 1057 1 ___AR_D___P 2 0 > zsh-2950 [003] 879.500788: dump_page_cache: 1058 6 ___A______P 2 0 > zsh-2950 [003] 879.500788: dump_page_cache: 9249 1 ___AR_____P 2 0 > zsh-2950 [003] 879.500789: dump_page_cache: 524289 1 ____R_____P 2 0 > zsh-2950 [003] 879.500790: dump_page_cache: 524290 2 ___A______P 2 0 > zsh-2950 [003] 879.500790: dump_page_cache: 524292 1 ___AR_____P 2 0 > zsh-2950 [003] 879.500791: dump_page_cache: 524293 1 ___A______P 2 0 > zsh-2950 [003] 879.500796: dump_page_cache: 524294 9 ____R_____P 2 0 > zsh-2950 [003] 879.500797: dump_page_cache: 524303 1 ___A______P 2 0 > zsh-2950 [003] 879.500798: dump_page_cache: 987136 1 ___AR_____P 2 0 > zsh-2950 [003] 879.500798: dump_page_cache: 1048576 1 ____R_____P 2 0 > zsh-2950 [003] 879.500799: dump_page_cache: 1048577 2 ___A______P 2 0 > zsh-2950 [003] 879.500800: dump_page_cache: 1048579 1 ___AR_____P 2 0 > zsh-2950 [003] 879.500801: dump_page_cache: 1048580 5 ___A______P 2 0 > zsh-2950 [003] 879.500802: dump_page_cache: 1048585 1 ___AR_____P 2 0 > zsh-2950 [003] 879.500805: dump_page_cache: 1048586 5 ___A______P 2 0 > zsh-2950 [003] 879.500805: dump_page_cache: 1048591 1 ___AR_____P 2 0 > zsh-2950 [003] 879.500806: dump_page_cache: 1572864 1 ____R_____P 2 0 > zsh-2950 [003] 879.500807: dump_page_cache: 1572865 5 ___A______P 2 0 > zsh-2950 [003] 879.500808: dump_page_cache: 1572870 1 ___AR_____P 2 0 > zsh-2950 [003] 879.500811: dump_page_cache: 1572871 6 ___A______P 2 0 > zsh-2950 [003] 879.500812: dump_page_cache: 1572877 3 ____R_____P 2 0 > zsh-2950 [003] 879.500816: dump_page_cache: 2097153 8 ____R_____P 2 0 > zsh-2950 [003] 879.500817: dump_page_cache: 2097161 1 ___A______P 2 0 > zsh-2950 [003] 879.500818: dump_page_cache: 2097162 4 ____R_____P 2 0 > zsh-2950 [003] 879.500819: dump_page_cache: 6324224 1 ____R_D___P 2 0 > zsh-2950 [003] 879.500820: dump_page_cache: 6324225 3 ___AR_____P 2 0 > zsh-2950 [003] 879.500825: dump_page_cache: 6324228 29 ___A______P 2 0 > zsh-2950 [003] 879.500826: dump_page_cache: 6324257 1 ____R_____P 2 0 > zsh-2950 [003] 879.500828: dump_page_cache: 6324258 4 ___A______P 2 0 > zsh-2950 [003] 879.500830: dump_page_cache: 6324262 11 ____R_____P 2 0 > zsh-2950 [003] 879.500833: dump_page_cache: 6324273 16 ___AR_____P 2 0 > zsh-2950 [003] 879.500833: dump_page_cache: 6324289 1 ___A______P 2 0 > zsh-2950 [003] 879.500834: dump_page_cache: 6324290 2 ___AR_____P 2 0 > zsh-2950 [003] 879.500835: dump_page_cache: 6324292 8 ___A______P 2 0 > zsh-2950 [003] 879.500836: dump_page_cache: 6324300 2 ___AR_____P 2 0 > zsh-2950 [003] 879.500837: dump_page_cache: 6324302 3 ___A______P 2 0 > zsh-2950 [003] 879.500838: dump_page_cache: 6324305 4 ____R_____P 2 0 > zsh-2950 [003] 879.500843: dump_page_cache: 6324309 28 ___AR_____P 2 0 > zsh-2950 [003] 879.500844: dump_page_cache: 6324337 4 ___A______P 2 0 > zsh-2950 [003] 879.500845: dump_page_cache: 6324341 2 ____R_____P 2 0 > zsh-2950 [003] 879.500850: dump_page_cache: 6324343 30 ___AR_____P 2 0 > zsh-2950 [003] 879.500851: dump_page_cache: 6324373 2 ___A______P 2 0 > zsh-2950 [003] 879.500852: dump_page_cache: 6324375 2 ___AR_____P 2 0 > zsh-2950 [003] 879.500853: dump_page_cache: 6324377 9 ___A______P 2 0 > zsh-2950 [003] 879.500854: dump_page_cache: 6324386 2 ___AR_____P 2 0 > zsh-2950 [003] 879.500855: dump_page_cache: 6324388 5 ___A______P 2 0 > zsh-2950 [003] 879.500856: dump_page_cache: 6324393 3 ___AR_____P 2 0 > zsh-2950 [003] 879.500858: dump_page_cache: 6324396 11 ___A______P 2 0 > zsh-2950 [003] 879.500859: dump_page_cache: 6324407 1 ____R_____P 2 0 > zsh-2950 [003] 879.500864: dump_page_cache: 6324408 31 ___AR_____P 2 0 > zsh-2950 [003] 879.500864: dump_page_cache: 6324439 1 ___A______P 2 0 > zsh-2950 [003] 879.500865: dump_page_cache: 6324440 1 ____R_____P 2 0 > zsh-2950 [003] 879.500866: dump_page_cache: 6324441 2 ___A______P 2 0 > zsh-2950 [003] 879.500867: dump_page_cache: 6324443 5 ____R_____P 2 0 > zsh-2950 [003] 879.500872: dump_page_cache: 6324448 26 ___AR_____P 2 0 > zsh-2950 [003] 879.500873: dump_page_cache: 6324474 6 ___A______P 2 0 > zsh-2950 [003] 879.500874: dump_page_cache: 6324480 4 ____R_____P 2 0 > zsh-2950 [003] 879.500879: dump_page_cache: 6324484 28 ___AR_____P 2 0 > zsh-2950 [003] 879.500880: dump_page_cache: 6324512 4 ___A______P 2 0 > zsh-2950 [003] 879.500881: dump_page_cache: 6324516 1 ____R_____P 2 0 > zsh-2950 [003] 879.500881: dump_page_cache: 6324517 1 ___A______P 2 0 > zsh-2950 [003] 879.500882: dump_page_cache: 6324518 2 ___AR_____P 2 0 > zsh-2950 [003] 879.500888: dump_page_cache: 6324520 28 ___A______P 2 0 > zsh-2950 [003] 879.500890: dump_page_cache: 6324548 2 ____R_____P 2 0 > > > fincore can takes a parameter or it returns a bit to distinguish > > referenced pages, but I don't think it's a good API. This should be > > transparent to userspace. > > Users care about the "cached" status may well be interested in the > "active/referenced" status. They are co-related information. fincore() > won't be a simple replication of mincore() anyway. fincore() has to > deal with huge sparsely accessed files. The accessed bits of a file > page are normally more meaningful than the accessed bits of mapped > (anonymous) pages. if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough. > Another option may be to use the above > /debug/tracing/objects/mm/pages/dump-file interface. > > > > - btrfs_metadata_readahead() can be passed to some (faked) > > > ->readpages() for use with fadvise. > > this need filesystem specific hook too, the difference is your proposal > > uses fadvise but I'm using ioctl. There isn't big difference. > > True for btrfs. However they make big differences for other file systems. why? > > BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I > > didn't find a easy way to do this. It might be possible to do this for > > example adding a fake device or fake fs (anon_inode doesn't work here, > > IIRC), which is a bit ugly. Before it's proved generic API can handle > > metadata readahead, I don't want to do it. > > Right, it could be hard to export btrfs_inode. I'm glad you speak it > out. If we cannot make it, it's valuable to point out the problem and > let everyone know the root cause we turn to an ioctl based workaround. > Then others will understand the design choices, and if lucky, join us > and help export the btrfs_inode. I didn't hide anything. I actually tell out this in the comments. this is what I said. In btrfs, metadata is stored in btree_inode. Ideally, if we could hook > > > > > > the inode to a fd so we could use existing syscalls (readahead, mincore > > > > > > or upcoming fincore) to do readahead, but the inode is hidden, there is > > > > > > no easy way for this from my understanding. Thanks, Shaohua > > > > > > this. One is like readahead syscall, the other is like micore/fincore > > > > > > syscall. > > > > > > Under a harddisk based netbook with Meego, the metadata readahead > > > > > > reduced about 3.5s boot time in average from total 16s. > > > > > > Last time I posted similar patches to btrfs maillist, which adds the > > > > > > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we > > > > > > have a generic interface to do this so other filesystem can share some > > > > > > code, so I came up with the new one. Comments and suggestions are > > > > > > welcome! > > > > > > > > > > > > v1->v2: > > > > > > 1. Added more comments and fix return values suggested by Andrew Morton > > > > > > 2. fix a race condition pointed out by Yan Zheng > > > > > > > > > > > > initial post: > > > > > > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2 > > > > > > > > > > > > Thanks, > > > > > > Shaohua > > > > > > > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html