On 19/05/2021 15.57, Dave Chinner wrote:
On Wed, May 19, 2021 at 11:00:03AM +0300, Avi Kivity wrote:
On 18/05/2021 02.22, Dave Chinner wrote:
What I'd like to do is remove the fanout directories, so that for each logical
"volume"[*] I have a single directory with all the files in it. But that
means sticking massive amounts of entries into a single directory and hoping
it (a) isn't too slow and (b) doesn't hit the capacity limit.
Note that if you use a single directory, you are effectively single
threading modifications to your file index. You still need to use
fanout directories if you want concurrency during modification for
the cachefiles index, but that's a different design criteria
compared to directory capacity and modification/lookup scalability.
Something that hit us with single-large-directory and XFS is that
XFS will allocate all files in a directory using the same
allocation group. If your entire filesystem is just for that one
directory, then that allocation group will be contended.
There is more than one concurrency problem that can arise from using
single large directories. Allocation policy is just another aspect
of the concurrency picture.
Indeed, you can avoid this specific problem simply by using the
inode32 allocator - this policy round-robins files across allocation
groups instead of trying to keep files physically local to their
parent directory. Hence if you just want one big directory with lots
of files that index lots of data, using the inode32 allocator will
allow the files in the filesytsem to allocate/free space at maximum
concurrency at all times...
Perhaps a directory attribute can be useful in case the filesystem is
created independently of the application (say by the OS installer).
We saw spurious ENOSPC when that happened, though that
may have related to bad O_DIRECT management by us.
You should not see spurious ENOSPC at all.
The only time I've recall this sort of thing occurring is when large
extent size hints are abused by applying them to every single file
and allocation regardless of whether they are needed, whilst
simultaneously mixing long term and short term data in the same
physical locality.
Yes, you remember well.
Over time the repeated removal and reallocation
of short term data amongst long term data fragments the crap out of
free space until there are no large contiguous free spaces left to
allocate contiguous extents from.
We ended up creating files in a temporary directory and moving them to the
main directory, since for us the directory layout was mandated by
compatibility concerns.
inode32 would have done effectively the same thing but without
needing to change the application....
It would not have helped the installed base.
We are now happy with XFS large-directory management, but are nowhere close
to a million files.
I think you are conflating directory scalability with problems
arising from file allocation policies not being ideal for your data
set organisation, layout and longevity characteristics.
Probably, but these problems can happen to others using large
directories. The XFS list can be very helpful in resolving them, but
better to be warned ahead.