sorry for english.. > 19 марта 2017 г., в 3:39, Theodore Ts'o <tytso@xxxxxxx> написал(а): > > On Sat, Mar 18, 2017 at 08:17:55PM +0300, Alexey Lyashkov wrote: >>> >>> That's not true. We normally only read in one block a time. If there >>> is a hash collision, then we may need to insert into the rbtree in a >>> subsequent block's worth of dentries to make sure we have all of the >>> directory entries corresponding to a particular hash value. I think >>> you misunderstood the code. >> >> As i see it not about hash collisions, but about merging a several >> blocks into same hash range on up level hash entry. so if we have a >> large hash range originally assigned to the single block, all that >> range will read at memory at single step. With «aged» directory >> when hash blocks used already - it’s easy to hit. > > If you look at ext4_htree_fill_tree(), we are only iterating over the > leaf blocks. We are using a 31-bit hash, where the low-order bit is > one if there has been a collision. In that case, we need to read the > next block to make sure all of the directory entries which have the > same 31-bit hash are in the rbtree. looks we say about same but with different words. based on dx code, up level hash block have a records - {hash1, block1} {hash2, block1}… so any records with hash range {hash1, hash2} will live on block1. right ? so question how much {hash1, hash2} distance may be, looks you name it as "hash collision". > You seem very passionate about this. Is this a problem you've > personally seen? If so, can you give me more details about your use > case, and how you've been running into this issue? Instead of just > arguing about it from a largely theoretical perspective? > It problem was seen with Lustre MDT code after large number create/unlinks. but it seen only few times. Other hits isn’t conformed. >> Yes, i expect to have some seek penalty. But may testing say it too huge now. >> directory creation rate started with 80k create/s have dropped to the 20k-30k create/s with hash tree extend to the level 3. >> Same testing with hard links same create rate dropped slightly. > > So this sounds like it's all about the seek penalty of the _data_ > blocks. no. > If you use hard links the creation rate only dropped a > little, am I understanding you corretly? yes and no. hard link create rate dropped a little, but open()+close tests dropped a large. No writes, no data blocks, just inode allocation. > (Sorry, your English is a > little fracturered so I'm having trouble parsing the meaning out of > your sentences.) it’s my bad :( > > So what do you think the creation rate _should_ be? And where do you > think the time is going to, if it's not the fact that we have to place > the data blocks further and further from the directory? And more > importantly, what's your proposal for how to "fix" this? > >>> As for the other optimizations --- things like allowing parallel >>> directory modifications, or being able to shrink empty directory >>> blocks or shorten the tree are all improvements we can make without >>> impacting the on-disk format. So they aren't an argument for halting >>> the submission of the new on-disk format, no? >>> >> It’s argument about using this feature. Yes, we can land it, but it decrease an expected speed in some cases. > > But there are cases where today, the workload would simply fail with > ENOSPC when the directory couldn't grow any farther. So in those > cases _maybe_ there is something we could do differently that might > make things faster, but you have yet to convince me that the > fundamental fault is one that can only be cured by an on-disk format > change. (And if you believe this is to be true, please enlighten us > on how we can make the on-disk format better!) > > Cheers, > > - Ted