On 06/17/2011 09:29 PM, Andreas Dilger wrote: > On 2011-06-17, at 12:44 PM, Coly Li wrote: >> On 2011年06月18日 00:01, Bernd Schubert Wrote: >>> While creating files in large directories we noticed an endless >>> number of 4K reads. And those reads very much reduced file >>> creation numbers as shown by bonnie. While we would expect about >>> 2000 creates/s, we only got about 25 creates/s. Running the >>> benchmarks for a long time improved the numbers, but not above >>> 200 creates/s. It turned out those reads came from directory >>> index block reads and probably the bh cache never cached all dx >>> blocks. Given by the high number of directories we have (8192) >>> and number of files required to trigger the issue (16 million), >>> rather probably bh cached dx blocks got lost in favour of other >>> less important blocks. The patch below implements a read-ahead >>> for *all* dx blocks of a directory if a single dx block is >>> missing in the cache. That also helps the LRU to cache important >>> dx blocks. >>> >>> Unfortunately, it also has a performance trade-off for the first >>> access to a directory, although the READA flag is set already. >>> Therefore at least for now, this option is disabled by default, >>> but may be enabled using 'mount -o dx_read_ahead' or 'mount >>> -odx_read_ahead=1' >>> >>> Signed-off-by: Bernd Schubert >>> <bernd.schubert@xxxxxxxxxxxxxxxxxx> --- >> >> A question is, is there any performance number for dx dir read >> ahead ? My concern is, if buffer cache replacement behavior is not >> ideal, which may replace a dx block by other (maybe) more hot >> blocks, dx dir readahead will introduce more I/Os. In this case, we >> may focus on exploring why dx block is replaced out of buffer >> cache, other than using dx readahead. > > There was an issue we observed in our testing, where the kernel > per-CPU buffer LRU was too small, and for large htree directories the > buffer cache was always thrashing. Currently the kernel has: > > #define BH_LRU_SIZE 8 > > but it should be larger (16 improved performance for us by about 10%) > on a 16-core system in our testing (excerpt below): >> - name find of ext4 will consume about 3 slots - creating inode >> will take about 3 slots - name insert of ext4 will consume another >> 3-4 slots. - we also have some attr_set/xattr_set, which will >> access the LRU as well. >> >> So some BHs will be popped out from LRU before it can be using >> again, actually profile shows __find_get_block_slow() and >> __find_get_block() are the top time consuming functions. I tried >> to increase BH_LRU_SIZE to 16, and see about 8% increasing of >> opencreate+close rate on my branch, so I guess we actually have >> about 10% improvement for opencreate(only, no close) just by >> increasing BH_LRU_SIZE. I think we need to fix bh_lru_install() first, which uses BH_LRU_SIZE to allocate memory on the stack. I already tried to play with a high number of BH_LRU_SIZE, but run into stack memory issues... Shouldn't be too difficult to fix, though. > > > >> [snip] >>> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index >>> 6f32da4..78290f0 100644 --- a/fs/ext4/namei.c +++ >>> b/fs/ext4/namei.c @@ -334,6 +334,35 @@ struct stats >>> dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir, >>> #endif /* DX_DEBUG */ >>> >>> /* + * Read ahead directory index blocks + */ +static void >>> dx_ra_blocks(struct inode *dir, struct dx_entry * entries) +{ + >>> int i, err = 0; + unsigned num_entries = dx_get_count(entries); >>> + + if (num_entries < 2 || num_entries > dx_get_limit(entries)) >>> { + dxtrace(printk("dx read-ahead: invalid number of >>> entries\n")); + return; + } + + dxtrace(printk("dx read-ahead: >>> %d entries in dir-ino %lu \n", + num_entries, dir->i_ino)); + + >>> i = 1; /* skip first entry, it was already read in by the caller >>> */ + do { + struct dx_entry *entry; + ext4_lblk_t block; + + >>> entry = entries + i; + + block = dx_get_block(entry); + err = >>> ext4_bread_ra(dir, dx_get_block(entry)); + i++; + } while (i < >>> num_entries && !err); +} > > Two objections here - this is potentially a LOT of readahead that > might not be accessed. Why not limit the number of readahead blocks > to some reasonable amount (e.g. 32 or 64, maybe (BH_LRU_SIZE-1) is > best to avoid thrashing?) and continue to submit more readahead as it > traverses the directory. Do you suggest to start with a small read-ahead and to increase the number over the time? Where could we save the last number of read-aheads? Using a fixed value and then always min_block = current_dx_block - max_ra /2 max_block = current_dx_block + max_ra /2 (well something like that...) would be rather easy to add. > > It is also possible to have ext4_map_blocks() map an array of blocks > at one time, which might improve the efficiency of this code a bit > (it needs to hold i_data_sem during the mapping, so doing more work > at once is better). Hmm, that would be good of course, but will it really work? On a first glance I see that "struct ext4_map_blocks" needs an explanation about the member names ;) But then I don't see an array there, but just a block number (map->m_lblk) and the number of blocks to map (map->m_len). So doesn't the mean the blocks need to be linearly one after the other on the disk? But do have dx-blocks really that layout? From my printks it also didn't look that way. It would be good, if we could allocate large directories from the beginning, though (should help Lustres object-hash directories as well). Its a bit late here and might miss something, though. Thanks for your review! > > I also observe some strange inefficiency going on in buffer lookup: > > __getblk() ->__find_get_block() ->lookup_bh_lru() > ->__find_get_block_slow() > > but if that fails, __getblk() continues on to call: > > ->__getblk_slow() ->unlikely() error message ->__find_get_block() > ->lookup_bh_lru() ->__find_get_block_slow() ->grow_buffers() > > It appears there is absolutely no benefit to having the initial call > to __find_get_block() in the first place. The "unlikely() error > message" is out-of-line and shouldn't impact perf, and the "slow" > part of __getblk_slow() is skipped if __find_get_block() finds the > buffer in the first place. > > I could see possibly having __getblk->lookup_bh_lru() for the > CPU-local lookup avoiding some extra function calls (it would also > need touch_buffer() if it finds it via lookup_bh_lru(). I will also look into it on Monday. Its too late for me to really follow you here. Have a nice weekend! Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html