Ted,
sorry for my late reply and thanks a lot for your help!
On 07/18/2011 02:23 AM, Ted Ts'o wrote:
On Jul 16, 2011, at 9:02 PM, Bernd Schubert wrote:
I don't understand it either yet why we have so many, but each directory
has about 20 to 30 index blocks
OK, I think I know what's goign on. Those are 20-30 index blocks;
those are 20-30 leaf blocks. Your directories are approximately
80-120k, each, right?
Yes, you are right. For example:
drwxr-xr-x 2 root root 102400 Jul 18 13:39 FFB
I also uploaded the debugfs htree output to
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/ext4/htree_dump.bz2
So what your patch is doing is constantly doing readahead to bring the
*entire* directory into the buffer cache any time you do a dx_probe.
That's definitely not what we would want to enable by default, but I
really don't like the idea of adding Yet Another Mount option. It
expands our testing effort, and the reality is very few people will
take advantage of the mount option.
How about this? What if we don't actually perform readahead, but
instead try to look up all of the blocks to see if they are in the
buffer cache using sb_find_get_block(). If it is in the the buffer
In principle that should be mostly fine. We could read all directories
on starting up our application and those pages would be kept in cache
then. While our main concern right now is the meta data server, where
that patch would help and where we will also change the on-disk-layout
to entirely workaround that issue, the issue also effects storage
servers. On those we are not sure, if the patch would help, as real data
pages might drop those directory pages out of the cache.
Also interesting is that the hole issue might easily explain meta data
issues I experienced in the past with Lustre and systems with lots of
files per OST - Lustre and FhGFS have a rather similar on-disk layout
for data files and so should suffer from similar underlying storage
issues. We are already discussing here for some time if we could change
that layout for FhGFS, but that would bring up other even more critical
problems...
cache, it will get touched, so it will be less likely to be evicted
from the page cache. So for a workload like yours, it should do what
you want. But if won't cause all of the pages to get pulled in after
the first reference of the directory in question.
I think we would still need to map the ext4 block to the real block for
sb_find_get_block(). So what about to keep most of the existing patch,
but update ext4_bread_ra() to:
+/*
+ * Read-ahead blocks
+ */
+int ext4_bread_ra(struct inode *inode, ext4_lblk_t block)
+{
+ struct buffer_head *bh;
+ int err;
+
+ bh = ext4_getblk(NULL, inode, block, 0, &err);
+ if (!bh)
+ return -1;
+
+ if (buffer_uptodate(bh))
+ touch_buffer(bh); /* patch update here */
+
+ brelse(bh);
+ return 0;
+}
+
I'm still worried about the case of a very large directory (say an
unreaped tmp directory that has grown to be tens of megabytes). If a
program does a sequential scan through the directory doing a
"readdir+stat" (i.e., for example a tmp cleaner or someone running the
command ls -sF"), we probably shouldn't be trying to keep all of those
directory blocks in memory. So if a sequential scan is detected, that
should probably suppress the calls to sb_find_get_block(0.
Do you have a suggestion how to detect that? Set a flag in the dir inode
on during a readdir and if that flag is set we won't do the
touch_buffer(bh)? So add flags in struct file i_private and struct inode
private about the readdir and remove those on close of the file?
Better would be if there would exist a readdirplus syscall, which ls
would use and which would set its intent itself... But even if we would
add that, it would take ages until all user space programs would use it.
Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html