Re: [PATCH 2/3] ext4 directory index: read-ahead blocks v2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ted,

sorry for my late reply and thanks a lot for your help!

On 07/18/2011 02:23 AM, Ted Ts'o wrote:
On Jul 16, 2011, at 9:02 PM, Bernd Schubert wrote:

I don't understand it either yet why we have so many, but each directory
has about 20 to 30 index blocks

OK, I think I know what's goign on.  Those are 20-30 index blocks;
those are 20-30 leaf blocks.  Your directories are approximately
80-120k, each, right?

Yes, you are right. For example:

drwxr-xr-x 2 root root 102400 Jul 18 13:39 FFB


I also uploaded the debugfs htree output to
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/ext4/htree_dump.bz2


So what your patch is doing is constantly doing readahead to bring the
*entire* directory into the buffer cache any time you do a dx_probe.
That's definitely not what we would want to enable by default, but I
really don't like the idea of adding Yet Another Mount option.  It
expands our testing effort, and the reality is very few people will
take advantage of the mount option.

How about this?  What if we don't actually perform readahead, but
instead try to look up all of the blocks to see if they are in the
buffer cache using sb_find_get_block().  If it is in the the buffer

In principle that should be mostly fine. We could read all directories on starting up our application and those pages would be kept in cache then. While our main concern right now is the meta data server, where that patch would help and where we will also change the on-disk-layout to entirely workaround that issue, the issue also effects storage servers. On those we are not sure, if the patch would help, as real data pages might drop those directory pages out of the cache. Also interesting is that the hole issue might easily explain meta data issues I experienced in the past with Lustre and systems with lots of files per OST - Lustre and FhGFS have a rather similar on-disk layout for data files and so should suffer from similar underlying storage issues. We are already discussing here for some time if we could change that layout for FhGFS, but that would bring up other even more critical problems...

cache, it will get touched, so it will be less likely to be evicted
from the page cache.  So for a workload like yours, it should do what
you want.  But if won't cause all of the pages to get pulled in after
the first reference of the directory in question.

I think we would still need to map the ext4 block to the real block for sb_find_get_block(). So what about to keep most of the existing patch, but update ext4_bread_ra() to:

+/*
+ * Read-ahead blocks
+ */
+int ext4_bread_ra(struct inode *inode, ext4_lblk_t block)
+{
+       struct buffer_head *bh;
+       int err;
+
+       bh = ext4_getblk(NULL, inode, block, 0, &err);
+       if (!bh)
+               return -1;
+
+       if (buffer_uptodate(bh))
+		touch_buffer(bh); /* patch update here */
+
+       brelse(bh);
+       return 0;
+}
+



I'm still worried about the case of a very large directory (say an
unreaped tmp directory that has grown to be tens of megabytes).  If a
program does a sequential scan through the directory doing a
"readdir+stat" (i.e., for example a tmp cleaner or someone running the
command ls -sF"), we probably shouldn't be trying to keep all of those
directory blocks in memory.  So if a sequential scan is detected, that
should probably suppress the calls to sb_find_get_block(0.

Do you have a suggestion how to detect that? Set a flag in the dir inode on during a readdir and if that flag is set we won't do the touch_buffer(bh)? So add flags in struct file i_private and struct inode private about the readdir and remove those on close of the file? Better would be if there would exist a readdirplus syscall, which ls would use and which would set its intent itself... But even if we would add that, it would take ages until all user space programs would use it.



Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux