On 06/17/2011 08:44 PM, Coly Li wrote: > On 2011年06月18日 00:01, Bernd Schubert Wrote: >> While creating files in large directories we noticed an endless number >> of 4K reads. And those reads very much reduced file creation numbers >> as shown by bonnie. While we would expect about 2000 creates/s, we >> only got about 25 creates/s. Running the benchmarks for a long time >> improved the numbers, but not above 200 creates/s. >> It turned out those reads came from directory index block reads >> and probably the bh cache never cached all dx blocks. Given by >> the high number of directories we have (8192) and number of files required >> to trigger the issue (16 million), rather probably bh cached dx blocks >> got lost in favour of other less important blocks. >> The patch below implements a read-ahead for *all* dx blocks of a directory >> if a single dx block is missing in the cache. That also helps the LRU >> to cache important dx blocks. >> >> Unfortunately, it also has a performance trade-off for the first access to >> a directory, although the READA flag is set already. >> Therefore at least for now, this option is disabled by default, but may >> be enabled using 'mount -o dx_read_ahead' or 'mount -odx_read_ahead=1' >> >> Signed-off-by: Bernd Schubert <bernd.schubert@xxxxxxxxxxxxxxxxxx> >> --- > > A question is, is there any performance number for dx dir read ahead ? Well, I benchmarked it all the week now. But in between bonnie++ and ext4 there is FhGFS... What exactly do you want to know? > My concern is, if buffer cache replacement behavior is not ideal, which may replace a dx block by other (maybe) more hot > blocks, dx dir readahead will introduce more I/Os. In this case, we may focus on exploring why dx block is replaced out > of buffer cache, other than using dx readahead. I think we have to differentiate between two different problems. Firstly we have to get all the indexes into memory at all and secondly, keep them in memory. Given by the high number of index blocks we have, it is not easy to differentiate between both and I had to add several printks or systemtap prints to get an idea why accessing the filesystem was so slow. > > > [snip] >> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c >> index 6f32da4..78290f0 100644 >> --- a/fs/ext4/namei.c >> +++ b/fs/ext4/namei.c >> @@ -334,6 +334,35 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir, >> #endif /* DX_DEBUG */ >> >> /* >> + * Read ahead directory index blocks >> + */ >> +static void dx_ra_blocks(struct inode *dir, struct dx_entry * entries) >> +{ >> + int i, err = 0; >> + unsigned num_entries = dx_get_count(entries); >> + >> + if (num_entries < 2 || num_entries > dx_get_limit(entries)) { >> + dxtrace(printk("dx read-ahead: invalid number of entries\n")); >> + return; >> + } >> + >> + dxtrace(printk("dx read-ahead: %d entries in dir-ino %lu \n", >> + num_entries, dir->i_ino)); >> + >> + i = 1; /* skip first entry, it was already read in by the caller */ >> + do { >> + struct dx_entry *entry; >> + ext4_lblk_t block; >> + >> + entry = entries + i; >> + >> + block = dx_get_block(entry); >> + err = ext4_bread_ra(dir, dx_get_block(entry)); >> + i++; >> + } while (i < num_entries && !err); >> +} >> + > > > I see sync reading here (CMIIW), this is performance killer. An async background reading ahead is better. But isn't it async? See in the new function ext4_bread_ra() please. After ll_rw_block(READA, 1, &bh) we don't wait for the buffer to be up-to-date, but immediately return. I also though about to add it to worker threads, but then though that only would be additional overhead without any gain. I didn't test and benchmark it, though. Thanks for your review! Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html