[ ... whether storing 1 bilion 7KB (average) records are best stored in a database or 1 per file in a file system ... ] >>> One thing that you can do when doing bulk loads of files >>> (say, during a restore or migration), is to use a two phase >>> write. First, write each of a batch of files (say 1000 files >>> at a time), then go back and reopen/fsync/close them. >> Why not just restore a database? > If you started with a database, that would be reasonable. If > you started with a file system, I guess I don't understand > what you are suggesting. Well, the topic of this discussion is whether one *should* start with a database for the "lots of small records" case. It is not a new topic by any means -- there have been many debates in the past as to how silly it is to have immense file-per-message news/mail spool archives with lots of little files. The outcome has always been to store them in databased of one sort or another. >>>>> One layout for directories that works well with this kind >>>>> of thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN >>>>> where MIN might be 0, 5, 10, ..., 55 for example). >>> As to the problem above and ths kind of solution, I reckon >>> that it is utterly absurd (and I could have used much >>> stronger words). >>> When you deal with systems that store millions of files, >> Millions of files may work; but 1 billion is an utter >> absurdity. A filesystem that can store reasonably 1 billion >> small files in 7TB is an unsolved research issue ... [ >> ... and fsck ... ] > Strangely enough, I have been testing ext4 and stopped filling > it at a bit over 1 billion 20KB files on Monday (with 60TB of > storage). Is that a *reasonable* use of a filesystem? Have you compared to storing 1 billion 20KB records in a simple database? As an aside, 20KB is no longer than much in the "small files" range. For example, one stupid idea of storing records as "small files" is the enormous internal fragmentation caused by 4KiB allocation granularity, which swells space used too. Even for the original problem, which was about: > ~1000.000.000 files (1-30k) > ~7TB in total that is presumably lots of files under 4KiB if the average file size is 7KB in a range between 1-30KB. Also looking at my humble home system, at the root filesystem and a media (RPMs, TARs, ZIPs, JPGs, ISOs, ...) archival filesystem (both JFS): base# df / /fs/basho Filesystem 1M-blocks Used Available Use% Mounted on /dev/sdb1 11902 9712 2191 82% / /dev/sda8 238426 228853 9573 96% /fs/basho base# df -i / /fs/basho Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdb1 4873024 359964 4513060 8% / /dev/sda8 19738976 126493 19612483 1% /fs/basho I see that files under 4K are the vast majority on one and a large majority on the other: base# find / -xdev -type f -size -4000 | wc -l 305064 base# find /fs/basho -xdev -type f -size -4000 | wc -l 107255 Anyhow, because while some people make (because they do "work") fielsystems with millions and even billion inodes and/or 60TB capacities (on 60+1 RAID5s sometimes), the question is whether it makes sense or is an absurdity on its own merits and when compared to a database. That something stupid can be done is not an argument for doing it. The arguments I referred to in my original comments show just how expensive is to misuse a directory hierarchy in a filesystem as if it were an index in a database, by comparing them: "I have a little script, the job of which is to create a lot of very small files (~1 million files, typically ~50-100 bytes each)." "It's a bit of a one-off (or twice, maybe) script, and currently due to finish in about 15 hours," "creates a Berkeley DB database of K records of random length varying between I and J bytes," "So, we got 130MiB of disc space used in a single file, >2500 records sustained per second inserted over 6 minutes and a half," Perhaps 50-100 bytes is a bit extreme, but still compare "due to finish in about 15 hours" with "6 minutes and a half". Now, in that case a large part of the speedup is that the records were small enough that 1m of them as a database would fit into memory (that BTW was part of the point why using a filesystem for that was utterly absurd). I'd rather not do a test with 1G 6-7KB records on my (fairly standard, small, 2GHz PCU, 2GiB RAM) home PC, but 1M 6-7KB records is of course feasible, and on a single modern disk with 1 TB (and a slightly prettified updated script using BTREE) I get (1M records with a 12 byte key, record length random between 2000 and 10000 bytes): base# rm manyt.db base# time perl manymake.pl manyt.db 1000000 2000 10000 1 percent done, 990000 to go 2 percent done, 980000 to go 3 percent done, 970000 to go .... 98 percent done, 20000 to go 99 percent done, 10000 to go 100 percent done, 0 to go real 81m6.812s user 0m29.957s sys 0m30.124s base# ls -ld manyt.db -rw------- 1 root root 8108961792 Sep 19 20:36 manyt.db The creation script flushes every 1% too, but from the pathetic peak 3-4MB/s write rate it is pretty obvious that on my system things don't get cached a lot (by design...). As to reading, 10000 records at random among those 1M: base# time perl manyseek.pl manyt.db 1000000 10000 1 percent done, 9900 to go 2 percent done, 9800 to go 3 percent done, 9700 to go .... 98 percent done, 200 to go 99 percent done, 100 to go 100 percent done, 0 to go average length: 5984.4108 real 7m22.016s user 0m0.210s sys 0m0.442s That is on the slower half of a 1T drive in a half empty JFS filesystem. That's 200/s 6KB average records inserted, and about 22/s looked up, which is about as good as the drive can do, all in a single 8GB file. Sure, a lot slower than 50-100 bytes as it can no longer much fit into memory, but still way off "due to finish in about 15 hours". Sure the system I used for the new test is a bit faster than the one used for the "in about 15 hours" test, but we are still talking one arm, which is largely the bottleneck. But wait -- I am JOKING. because it is ridiculous to load a 1M record dataset into an indexed database one record at a time. Sure it is *possible*, but any sensible database has a bulk loader that builds the index after loading the data. So in any reasonable scenario the difference when *restoring* a backedup filesystem will be rather bigger than for the scenario above. Sure, some file systems have 'dump' like tools that help, but they don't recreate a nice index, they just restore it. Ah well. Now let's see a much bigger scale test: > [ ... ] testing ext4 and stopped filling it at a bit over 1 > billion 20KB files on Monday (with 60TB of storage). Running > fsck on it took only 2.4 hours. [ ... ] > [ ... ] 20KB files written to ext4 run at around 3,000 > files/sec. It took us about 4 days to fill it to 1 billion > files [ ... ] That sounds like you did use 'fsync' per file or something similar, as you had written: >>>> If you are writing to a local S-ATA disk, ext3/4 can write a >>>> few thousand files/sec without doing any fsync() operations. >>>> With fsync(), you will drop down quite a lot. and here you report around 3000/s over a 60TB array. Then 20KBx3000/s is 60MB/s -- rather unimpressive score for a 60TB filesystem (presumably spread over 60 drives or more), even with 'fsync'. And the creation record rate itself looks like about 50 records/s per drive. That is rather disappointing. Yes, they are larger files, but that should not cause that much slowdown. Also, the storage layout is not declared (except that you are storing 20TB of data in 60TB of drives, which is a bit of a cheat), and it would be also quite interesting to see the output of that 'fsck' run: > and 2.4 hours to fsck. But that is an unreasonable test, even if it is the type of test popular with some file system designers, precisely because... Testing file system performance just after loading is a naive or cheating exercise, especially with 'ext4' (and 'ext3'), as after loading all those inodes and files are going to be nearly optimally laid out (e.g. list of inode numbers in a directory pretty much sequential), and with 'ext4' each file will consist of a single extent (hopefully), so less metadata. But a filesystem that simulates a simple small object database will as a rule not be so lucky; it will grow and be modified. Even worse, 'fsck' on a filesystem *without damage* is just an exercise in enumerating inodes and other metadata. What is interesting is that happens when there is damage and 'fsck' has to start cross-correlating metadata. So here are some more realistic 'fsck' estimates from other filesystems and other times, who should be very familiar to those considering utterly absurd designs: http://ukai.org/b/log/debian/snapshot "long fsck on disks for old snapshot.debian.net is completed today. It takes 75 days!" "It still fsck for a month.... root 6235 36.1 59.7 1080080 307808 pts/2 D+ Jun21 15911:50 fsck.ext3 /dev/md5" That was I think before some improvements to 'ext3' checking. http://groups.google.com/group/linux.debian.ports.x86-64/msg/fd2b4d46a4c294b5 "Keep in mind if you go with XFS, you're going to need 10-15 gig of memory or swap space to fsck 6tb.. it needs about 9 gig to xfs_check, and 3 gig to xfs_repair a 4tb array on one of my systems.. oh, and a couple days to do either. :)" "> Generally, IMHO no. A fsck will cost a lot of time with > all filesystems. Some worse than others though.. looks like this 4tb is going to take 3 weeks.. it took about 3-4 hours on ext3.. If i had a couple gig of ram to put in the server that'd probably help though, as it's constantly swapping out a few meg a second." http://lists.us.dell.com/pipermail/linux-poweredge/2007-November/033821.html "> I'll definitely be considering that, as I already had to > wait hours for fsck to run on some 2 to 3TB ext3 > filesystems after crashes. I know it can be disabled, but > I do feel better forcing a complete check after a system > crash, especially if the filesystem had been mounted for > very long, like a year or so, and heavily used. The decision process for using ext3 on large volumes is simple: Can you accept downtimes measured in hours (or days) due to fsck? No - don't use ext3." http://www.mysqlperformanceblog.com/2006/10/08/small-things-are-better/ "Yesterday I had fun time repairing 1.5Tb ext3 partition, containing many millions of files. Of course it should have never happened - this was decent PowerEdge 2850 box with RAID volume, ECC memory and reliable CentOS 4.4 distribution but still it did. We had "journal failed" message in kernel log and filesystem needed to be checked and repaired even though it is journaling file system which should not need checks in normal use, even in case of power failures. Checking and repairing took many hours especially as automatic check on boot failed and had to be manually restarted." Another factor is just how "complicated" the filesystem is, and for example 'fsck' times with large numbers of hard links can be very bad (and there are quite a few use cases like 'rdiff-backup'). Also, what about the few numbers you mention above? The 2.4 hours for 1 billion files mean 110K inodes examined per second. Now 60TB probably means like 60 1TB drives to store 20TB of data, a pretty large degree of parallelism. T'so reports: http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/ which shows that on a single (laptop) drive an 800K inode/90GB 'ext4' filesystem could be checked in 63s or around 12K inodes/s per drive, not less than 2K. There seems to be a scalability problem -- but of course: one of the "unsolved research issue"s is that while read/write/etc. can be parallelized (for large files) by using wide RAIDs, it is not so easy to parallelize 'fsck' (except by using multiple mostly independent filesystems). [ ... ] > The use case for big file systems with lots of small files (at > least the one that I know of) is for object based file systems > where files usually have odd, non-humanly generated file names > (think guids with time stamps and digital signatures). > These are pretty trivial to map into the time based directory > scheme I mentioned before. And it is utterly absurd to do so (see below). > [ ... ] benchmarked both large DB instances and large file > systems. Good use cases exist for both, but the facts do not > back up your DB is the only solution proposal :-) Sure, large filesystems (to a point, which for me is the single digit TB range) with large files have their place, even if people seem to prefer metafilesystem like Lustre even for those, for good reasons. But the discussion is whether it makes sense, for a case like 1G records averaging about 7KB, to use a filesystem with 200K directories with each 5K files (or something similar) one file per record, or a database with a nice overall index and a single or a few files for all records. Your facts above show that it is *possible* to create a similar (1G x 20K records) filesystem, and that it seem to make a rather poor use of a very large storage system. The facts that I referred to in my original comment show that there is a VERY LARGE performance difference between using a filesystem as a (very) small-record database for just 1M records, and a PRETTY LARGE difference even for 6KB records, and that doing something stupid on the database side. In the end the facts just confirm the overall discussion that I referred to in my original comment: http://lists.gllug.org.uk/pipermail/gllug/2005-October/055445.html "* The size of the tree will be around 1M filesystem blocks on most filesystems, whose block size usually defaults to 4KiB, for a total of around 4GiB, or can be set as low as 512B, for a total of around 0.5GiB. * With 1,000,000 files and a fanout of 50, we need 20,000 directories above them, 400 above those and 8 above those. So 3 directory opens/reads every time a file has to be accessed, in addition to opening and reading the file. * Each file access will involve therefore four inode accesses and four filesystem block accesses, probably rather widely scattered. Depending on the size of the filesystem block and whether the inode is contiguous to the body of the file this can involve anything between 32KiB and 2KiB of logical IO per file access. * It is likely that of the logical IOs those relating to the two top levels (those comprising 8 and 400 directories) of the subtree will be avoided by caching between 200KiB and 1.6MiB, but the other two levels, the 20,000 bottom directories and the 1,000,000 leaf files, won't likely be cached." These are pretty elementary considerations, and boil down to the issue of whether for a given dataset of "small" records the best index structure is a tree of directories or a nicely balanced index tree, and whether the "small" records should be at most one per (4KiB usually) block or can share blocks, and there is little doubt that tha latter wins pretty big. Your proposed directory based index "YEAR/MONTH/DAY/HOUR/MIN" seems to me particularly inane, as it has a *fixed fanout*, of 12 at the "MONTH" level, around 30 at the "DAY" level, 24 at the hour level, and 60 at the "MIN" level with no balancing. Fine if the record creation rate is constant. Perhaps not -- it involves 500K "MIN" directories per year. If we create 1G files per year we get around 2K files per "MIN" directory, each of which is then likely to be a few 4KiB blocks long. Fabulous :-). Sure, it is a *doable* structure, but it is not *reasonable*, especially if one knows the better alternative. Overall the data and arguments above suggests that: * Large filesystems (2 digits TB and more) usually should be avoided. * Filesystems with large numbers (more than a few millions) of files, even large files, should be avoided. * Large filesystems with a large number of small (around 4KiB) inodes (not just files) are utterly absurd, on their own merits, and even more so when compared with a database. * Two big issues are that while parallel storage scales up data performance, it does not do that well with metadata, and in particular metadata crawls such as 'fsck' are hard to parallelize (they are hard even when they in effect resolve just in mostly-linear scans). * If one *has* to have any of the above, separate filesystems, and/or filesystems based on a database-like design (e.g. based on indices throughout like HFS+ or Reiser3 or to some degree JFS and even XFS) may be the lesser evils, even if they have some limitations. But that is still fairly crazy. 'ar' files for one thing have been invented decades ago precisely because lots of small files and filesystems are a bad combination. These are conclusions well supported by experiment, data and simple reasoning as in the above. I should not have to explain these pretty obvious points in detail -- that databases are much better for large small record collections is not exactly a recent discovery. Sure, a lot of people "know better" and adopt what I call the "syntactically valid" approach, where if a combination is possible it is then fine. Good luck! _______________________________________________ Ext3-users mailing list Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users