Re: [RFC][PATCH 3/9 v1] ext4: add physical block and status member into extent status tree

Zheng Liu <gnehzuil.liu@xxxxxxxxx> · Tue, 8 Jan 2013 10:25:43 +0800

On Tue, Jan 08, 2013 at 12:27:54PM +1100, Dave Chinner wrote:
> On Sat, Jan 05, 2013 at 10:44:01AM +0800, Zheng Liu wrote:
> > On Wed, Jan 02, 2013 at 12:22:55PM +0100, Jan Kara wrote:
> > > On Tue 01-01-13 13:16:07, Zheng Liu wrote:
> > > > On Mon, Dec 31, 2012 at 10:49:52PM +0100, Jan Kara wrote:
> > > > > On Mon 24-12-12 15:55:36, Zheng Liu wrote:
> > > > > > From: Zheng Liu <wenqing.lz@xxxxxxxxxx>
> > > > > > 
> > > > > > es_pblk is used to record physical block that maps to the disk.  es_status is
> > > > > > used to record the status of the extent.  Three status are defined, which are
> > > > > > written, unwritten and delayed.
> > > > >   So this means one extent is 48 bytes on 64-bit architectures. If I'm a
> > > > > nasty user and create artificially fragmented file (by allocating every
> > > > > second block), extent tree takes 6 MB per GB of file. That's quite a bit
> > > > > and I think you need to provide a way for kernel to reclaim extent
> > > > > structures...
> > > > 
> > > > Indeed, when a file has a lot of fragmentations, status tree will occupy
> > > > a number of memory.  That is why it will be loaded on-demand.  When I make
> > > > it, there are two solutions to load status tree.  One is loading
> > > > on-demand, and another is loading complete extent tree in
> > > > ext4_alloc_inode().  Finally I choose the former because it can reduce
> > > > the pressure of memory at most of time.  But it has a disadvantage that
> > > > status tree doesn't be fully trusted because it hasn't track a
> > > > completely status of extent tree on disk.
> > >   Not reading the whole extent tree in ext4_alloc_inode() is a good start
> > > but it's not the whole solution IMHO. It saves us from unnecessary reading
> > > of extents but still if someone reads the whole filesystem (like
> > > grep -R "foo" /) you will still end up with all extents cached. And that
> > > will make ext4 inodes pretty heavy in memory. Surely inode reclaim will
> > > eventually release these inodes including cached extents but it is usually
> > > more beneficial to cache the inode itself than more extents so allowing us
> > > to strip cached extents without releasing inode itself would be good.
> > > 
> > > > I will provide a way to reclaim extent structures from status tree.  Now
> > > > I have an idea in my mind that we can reclaim all extent which are
> > > > WRITTEN/UNWRITTEN status because we always need DELAYED extent in
> > > > fiemap, seek_data/hole and bigalloc code.  Furthermore, as you said in
> > > > another mail, some unwritten extent which will be converted into
> > > > written also doesn't be reclaimed.
> > > > 
> > > > Another question is when do these extents reclaim?  Currently when
> > > > clear_inode() is called, the whole status tree will be reclaimed.  Maybe
> > > > a switch in sysfs is a optional choice.  Any thoughts?
> > >   The natural way to handle the shrinking is using 'shrinker' framework. In
> > > this case, we could register a shrinker for shrinking extents. Just having
> > > LRU of extents would increase the size of extent structure by 2 pointers
> > > which is too big I'd think and I'm not yet sure how to choose extents for
> > > reclaim in some other way. I will think about it...
> > 
> > Hi Jan,
> > 
> > Sorry for the delay.  'shrinker' framework is an option.  We can define
> > a callback function to reclaim extents from status tree.  When we access
> > an extent in an inode, we will move this inode into the tail of LRU list.
> > But this way has a defect that the spinlock which protects the LRU list
> > has a heavy contention because all inodes need to take this lock.  I
> > guess this overhead is unacceptable for us.  Any comments?
> 
> Measure it first. There are several filesystem global locks still
> in existance at the VFS level. solve the simple problem first, and
> then the hard problem might get solved for you by someone else. e.g:
> 
> http://oss.sgi.com/archives/xfs/2012-11/msg00643.html

Thanks for teaching me. :-)  I will measure its overhead first.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html