> 18 марта 2017 г., в 19:29, Theodore Ts'o <tytso@xxxxxxx> написал(а): > > On Sat, Mar 18, 2017 at 11:16:26AM +0300, Alexey Lyashkov wrote: >> Andreas, >> >> it not about a feature flag. It’s about a situation in whole. >> Yes, we may increase a directory size, but it open a number a large problems. > >> 1) readdir. It tries to read all entries in memory before send to >> the user. currently it may eats 20*10^6 * 256 so several gigs, so >> increasing it size may produce a problems for a system. > > That's not true. We normally only read in one block a time. If there > is a hash collision, then we may need to insert into the rbtree in a > subsequent block's worth of dentries to make sure we have all of the > directory entries corresponding to a particular hash value. I think > you misunderstood the code. As i see it not about hash collisions, but about merging a several blocks into same hash range on up level hash entry. so if we have a large hash range originally assigned to the single block, all that range will read at memory at single step. With «aged» directory when hash blocks used already - it’s easy to hit. > >> 2) inode allocation. Current code tries to allocate an inode as near as possible to the directory inode, but one GD may hold 32k entries only, so increase a directory size will use up 1k GD for now and more than it, after it. It increase a seek time with file allocation. It was i mean when say - «dramatically decrease a file creation rate». > > But there are also only 32k blocks in a group descriptor, and we try > to keep the blocks allocated close to the inode. with bigalloc feature it’s not a 32k blocks. but 32k clusters with 1M cluster size(as example), it very large space. > So if you are using > a huge directory, and you are using a storage device with a > significant seek penalty, then yes, no matter what as the directory > grows, the time to iterate over all of the files does grow. But there > is more to life than microbenchmarks which creates huge numbers of zero > length files! If we assume that the files are going to need to > contain _data_, and the data blocks should be close to the inodes, > then there are going to be some performance impacts no matter what. > Yes, i expect to have some seek penalty. But may testing say it too huge now. directory creation rate started with 80k create/s have dropped to the 20k-30k create/s with hash tree extend to the level 3. Same testing with hard links same create rate dropped slightly. >> 3) current limit with 4G inodes - currently 32-128 directories may eats a full inode number space. From it perspective large dir don’t need to be used. > > I can imagine a new feature flag which defines the use a 64-bit inode > number, but that's more for people who are creating a file system that > takes advantage of 64-bit block numbers, and they are intending on > using all of that space to store small (< 4k or < 8k) files. > > And it's also true that there are huge advantges to using a > multi-level directory hierarchy --- e.g.: > > .git/objects/03/08e42105258d4e53ffeb81ffb2a4b2480bb8b8 > > or even > > .git/objects/03/08/e42105258d4e53ffeb81ffb2a4b2480bb8b8 > > instead of: > > .git/objects/0308e42105258d4e53ffeb81ffb2a4b2480bb8b8 > > but that's an application level question. If for some reason some > silly application programmer wants to have a single gargantuan > directory, if the patches to support it are fairly simple, even if > someone is going to give us patches to do something more general, > burning an extra feature flag doesn't seem like the most terrible > thing in the world. >From other side - application don’t expect to have very slow directory and have access with some constant or near or it speed. > > As for the other optimizations --- things like allowing parallel > directory modifications, or being able to shrink empty directory > blocks or shorten the tree are all improvements we can make without > impacting the on-disk format. So they aren't an argument for halting > the submission of the new on-disk format, no? > It’s argument about using this feature. Yes, we can land it, but it decrease an expected speed in some cases. Alexey