On 09/14/2009 05:08 PM, Peter Grandi wrote:
[ ... ]
Also note that in order to write 10^9 files at 10^3/s rate
takes 10^6 seconds; roughly 10 days to populate the
filesystem (or at least that to restore it from backups).
One thing that you can do when doing bulk loads of files (say,
during a restore or migration), is to use a two phase
write. First, write each of a batch of files (say 1000 files
at a time), then go back and reopen/fsync/close them.
Why not just restore a database?
If you started with a database, that would be reasonable. If you started with a
file system, I guess I don't understand what you are suggesting.
One layout for directories that works well with this kind of
thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where
MIN might be 0, 5, 10, ..., 55 for example).
As to the problem above and ths kind of solution, I reckon that
it is utterly absurd (and I could have used much stronger words).
When you deal with systems that store millions of files,
Millions of files may work; but 1 billion is an utter absurdity.
A filesystem that can store reasonably 1 billion small files in
7TB is an unsolved research issue...
Strangely enough, I have been testing ext4 and stopped filling it at a bit over
1 billion 20KB files on Monday (with 60TB of storage).
Running fsck on it took only 2.4 hours.
The obvious thing to do is to use a database, and there is no
way around this point.
Everything has a use case. I am certainly not an anti-DB person, but your
assertion alone is not convincing.
If one genuinely needs to store a lot of files, why not split
them into many independent filesystems? A single large one is
only need to allow for hard linking or for having a single large
space pool, and in applications where the directory structure
above makes any kind of sense that neither is usually required.
Splitting a big file system into small ones means that you (the application or
sys admin) must load balance where to put new files instead of having the system
do it for you.
you pretty much always are going to use some kind of made up
directory layout.
The use case for big file systems with lots of small files (at least the one
that I know of) is for object based file systems where files usually have odd,
non-humanly generated file names (think guids with time stamps and digital
signatures).
These are pretty trivial to map into the time based directory scheme I mentioned
before.
File systems are usually used for storing somewhat unstructured
information, not records that can be looked up with a simple
"YEAR/MONTH/DAY/HOUR/MIN" key, which seems very suitable for
something like a simpel DBMS.
There is even a tendency to move filesystems into databases, as
they scale a lot better.
And for cases where a filesystem still makes sense I would
rather use, instead of the inane manylevel directory structure
above, a file system design with proper tree indexes and perhaps
even one with the ability to store small files into inodes.
[ ... ]
Have you tried to make a production DB with 1 billion records? Or done
experiments with fs vs db schemes?
You can always try to write 1 million files in a single
subdirectory,
Again, I'd rather avoid anything like that.
but if you are writing your own application, using this kind
of scheme is pretty trivial.
And an utter absurdity, for 1 billion files in 200k directories.
Both on its own merits and compared to the OBVIOUS alternative.
If anything, consider the obvious (obvious except to those
who want to use a filesystem as a small record database),
which is 'fsck' time, in particular given the structure of
'ext3' (or 'ext4') metadata.
fsck time has improved quite a lot recently with ext4 (and
with xfs).
How many months do you think a 7TB filesystem with 1 billion
files would take to 'fsck' even with those improvements? Even
with the nice improvements?
20KB files written to ext4 run at around 3,000 files/sec. It took us about 4
days to fill it to 1 billion files and 2.4 hours to fsck.
Not to be mean, but I have worked in this exact area and have benchmarked both
large DB instances and large file systems. Good use cases exist for both, but
the facts do not back up your DB is the only solution proposal :-)
ric
_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users