On Tue, Jan 08, 2013 at 12:27:54PM +1100, Dave Chinner wrote: > On Sat, Jan 05, 2013 at 10:44:01AM +0800, Zheng Liu wrote: > > On Wed, Jan 02, 2013 at 12:22:55PM +0100, Jan Kara wrote: > > > On Tue 01-01-13 13:16:07, Zheng Liu wrote: > > > > On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote: > > > > > On Mon 24-12-12 15:55:36, Zheng Liu wrote: > > > > > > From: Zheng Liu <wenqing.lz@xxxxxxxxxx> > > > > > > > > > > > > es_pblk is used to record physical block that maps to the disk. es_status is > > > > > > used to record the status of the extent. Three status are defined, which are > > > > > > written, unwritten and delayed. > > > > > So this means one extent is 48 bytes on 64-bit architectures. If I'm a > > > > > nasty user and create artificially fragmented file (by allocating every > > > > > second block), extent tree takes 6 MB per GB of file. That's quite a bit > > > > > and I think you need to provide a way for kernel to reclaim extent > > > > > structures... > > > > > > > > Indeed, when a file has a lot of fragmentations, status tree will occupy > > > > a number of memory. That is why it will be loaded on-demand. When I make > > > > it, there are two solutions to load status tree. One is loading > > > > on-demand, and another is loading complete extent tree in > > > > ext4_alloc_inode(). Finally I choose the former because it can reduce > > > > the pressure of memory at most of time. But it has a disadvantage that > > > > status tree doesn't be fully trusted because it hasn't track a > > > > completely status of extent tree on disk. > > > Not reading the whole extent tree in ext4_alloc_inode() is a good start > > > but it's not the whole solution IMHO. It saves us from unnecessary reading > > > of extents but still if someone reads the whole filesystem (like > > > grep -R "foo" /) you will still end up with all extents cached. And that > > > will make ext4 inodes pretty heavy in memory. Surely inode reclaim will > > > eventually release these inodes including cached extents but it is usually > > > more beneficial to cache the inode itself than more extents so allowing us > > > to strip cached extents without releasing inode itself would be good. > > > > > > > I will provide a way to reclaim extent structures from status tree. Now > > > > I have an idea in my mind that we can reclaim all extent which are > > > > WRITTEN/UNWRITTEN status because we always need DELAYED extent in > > > > fiemap, seek_data/hole and bigalloc code. Furthermore, as you said in > > > > another mail, some unwritten extent which will be converted into > > > > written also doesn't be reclaimed. > > > > > > > > Another question is when do these extents reclaim? Currently when > > > > clear_inode() is called, the whole status tree will be reclaimed. Maybe > > > > a switch in sysfs is a optional choice. Any thoughts? > > > The natural way to handle the shrinking is using 'shrinker' framework. In > > > this case, we could register a shrinker for shrinking extents. Just having > > > LRU of extents would increase the size of extent structure by 2 pointers > > > which is too big I'd think and I'm not yet sure how to choose extents for > > > reclaim in some other way. I will think about it... > > > > Hi Jan, > > > > Sorry for the delay. 'shrinker' framework is an option. We can define > > a callback function to reclaim extents from status tree. When we access > > an extent in an inode, we will move this inode into the tail of LRU list. > > But this way has a defect that the spinlock which protects the LRU list > > has a heavy contention because all inodes need to take this lock. I > > guess this overhead is unacceptable for us. Any comments? > > Measure it first. There are several filesystem global locks still > in existance at the VFS level. solve the simple problem first, and > then the hard problem might get solved for you by someone else. e.g: > > http://oss.sgi.com/archives/xfs/2012-11/msg00643.html Thanks for teaching me. :-) I will measure its overhead first. Regards, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html