[ ... whether datasets like 1G records for a total of 7TB should be stored as one-record-per-file in a filesystem or as a database ... ] >>> When you deal with systems that store millions of files, >> Millions of files may work; but 1 billion is an utter absurdity. >> A filesystem that can store reasonably 1 billion small files in >> 7TB is an unsolved research issue... > I'd disagree. We have Lustre filesystems with 500M files on > the ext4(ish) metadata server, and these are only 4TB. Note > there is NO DATA in the metadata files, so it isn't quite like > a normal filesystem. That is possible, but to me seems quite unreasonable. How long does that take to RSYNC, for example? To just backup? What about doing a 'find'? These are mad things. This is the special case of an MDS as you mention, but it is still fairly dangerous. Just like many other similar choices (e.g. 19+1 RAID5 arrays), it works (not so awesomely) as long as it works, and when it breaks it is very bad. I like the Lustre idea, and to me it is currently the best of a not very enthusing lot, but the MDT is by far the weakest bit, and the ``lots of tiny files'' idea is one of the big deals. In particular size of MDTs is a significant scalability issue with Lustre, which was designed in older gentler times for purposes to which metadata scalability might not have been so essential. Like most good ideas it has been scaled up beyond expectations (UNIX-style), and perhaps it is reaching the end of its useful range. Fortunately sensible Lustre people keep frequent and wholesame MDS backups, and restoring a backup, and even a 500M 800B file backup/restore is hopefully much faster than an 'fsck' if there is damage. > It also depends on what you mean by "small files". We've > previously discussed storing small file data in an extended > attribute, and if you are tuning for this and the file size is > small enough (3kB or less) the file data could be stored > inside the inode (i.e. zero seek data IO). If I were to use a filesystem as a makeshift database I would indeed use one of those filesystems that store small files or file tails in the metadata, as I wrote: >> And for cases where a filesystem still makes sense I would >> rather use, instead of the inane manylevel directory >> structure above, a file system design with proper tree >> indexes and perhaps even one with the ability to store >> small files into inodes. You might consider storing Lustre MDTs on Reiser3 instead of 'ldiskfs' :-). But this is backwards; the database guys have spent the past several decades working on the ``lots of small records reliably'' problem (and with "bushy" indices), and the main work by the file system guys has been solving the ``massive massively parallel files'' one. To the point that people like Reiser who did work (with database like techniques) on the small files problems for filesystems have been at best ignored. [ ... ] > I think you aren't backing your comments with any facts. You may think that -- but that's only because you think wrong, as you haven't read my comments or you want to misrepresent them. I made at the very start a clear example of a case with 1M small files engendering a difference between more than 15 hours vs. 6 minutes for just creation. For amusement I just rerun it in a nicer form on a somewhat faster system: base$ rm /fs/jugen/tmp/manysmall.db base$ time perl manymake.pl /fs/jugen/tmp/manysmall.db 1000000 50 100 1 percent done, 990000 to go 2 percent done, 980000 to go 3 percent done, 970000 to go .... 98 percent done, 20000 to go 99 percent done, 10000 to go 100 percent done, 0 to go real 0m48.209s user 0m6.240s sys 0m0.348s base$ ls -ld /fs/jugen/tmp/manysmall.db -rw------- 1 pcg pcg 98197504 Sep 21 16:19 /fs/jugen/tmp/manysmall.db That's 1M records in 10MB in less than a minute or 20K records/s, for around 1.5MB/s, which is fairly typical for random access to a fairly standard 1TB consumer drive in its latter half. base$ sudo sysctl vm.drop_caches=1 vm.drop_caches = 1 base$ time perl manyseek.pl /fs/jugen/tmp/manysmall.db 1000000 10000 1 percent done, 9900 to go 2 percent done, 9800 to go 3 percent done, 9700 to go .... 98 percent done, 200 to go 99 percent done, 100 to go 100 percent done, 0 to go average length: 69.3816 real 2m4.265s user 0m0.150s sys 0m0.126s Seeking of course is not awesome, and we get 10K records in 2m, or around 80 records/s. Ah well. I need an SSD :-). And as to the 'fsck', I confess that I had a list of cases in mind but was waiting for the usual worn out dodgy technique of quoting undamaged filesystem times: > The e2fsck time on our MDS filesystems with 500M IN USE inodes > is on the order of 4 hours (disk-based RAID-1+0 array). If > this was on a RAID-1+0 SSD it could be noticably faster. Ric > also commented previously about single-digit hours for e2fsck > on a test 1B file ext4 filesystem. That is a classic "benchmark" -- undamaged filesystem 'fsck' tests, like the other favourite, freshly loaded filesystem benchmarks, are just dodgy marketing tools. And even so! 1 hour per TB, or 1h per 100M files. To me keeping what may be production filesystem with 500M files unavailable for 4 hours because one occasionally has to run 'fsck' (even if in fact there is no damage) with an upside risk of weeks or months sounds not such a good idea. But who knows. There are been reports, which are sadly familiar to those who work as sysadms, of single digit TB filesystems taking weeks to months to repair, if damaged. The difference of course is between scanning the metadata and crawling it. Which is of course perfectly obvious, as RAIDs allow for parallelizing of read/write but not easily for scanning and less so for crawls. Scaling 'fsck' is not easy, is an unsolved research problem, even if things like Lustre help somewhat (minus the MDTs of course). Now I feel a bit preachy, I'll mention some wider concepts (mostly from the database guys) that should fit well in this discussion: * A "database" is defined as something including a dataset whose working set does not fit in memory (it thrashes -- every access involves at least one IO). There are several types of databases, structured/unstructured, factual/textual/...; a filesystem is a kind of database, as that definition applies. But to me and several decades of practice and theory it is a database of record _containers_ (as suggested by the very word "file"), not of records. It is exceptionally hard to do a DBMS that handles equally well records and record containers. * A "very large database" is a database that cannot be practically backed up (or checked) offline, as backup (or check) take too long wrt to requirements. Many filesystems are moving into the "very large database" category (can your customers accept that it might take 4 hours or 4 weeks to check, and 4 days to restore, their filesystem?). Storing small records (or small containers even) in a filesystem makes it much more likely that it becomes a "very large database", and while the technology for "very large databases" DBMSes is mature, that for "very large database" file system designs is not there or at least not as mature, even if the fun guys at Sun have been trying lately with ZFS. * These are not novel or little know concepts and experiences. 'ar' files have been around for a long time, for some good reason. _______________________________________________ Ext3-users mailing list Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users