On Sun, Aug 13, 2006 at 12:01:17AM -0700, Robinson Tiemuqinke wrote: > > A stupid flat directory /tmp holding 5 millon files, > the directory locates on a ext3 file system with > dir_index feature turned on. The running Linux are FC4 > and FC5. > > The files are just directly under /tmp, not in any > subdirectories -- they are results of mis-operations > of users. Wow! How many users do you have on your system? And over what period of time did this build up? >From a system administration point of view, a really good idea is to have a job which just deletes all file in /tmp that stick around for longer than 24 hours or so, and unconditionally on reboot. Then when the users scream, you can give them access to a /scratch partition which has lsightly more lax rules, such as deletion after 1 or 2 weeks, and with a README which says, "not backed up --- data can be deleted at any time, and if you complain, we will laugh at you". :-) >From a technical point of view, what's happening is that dir_index speeds up directory lookups by using a hash tree. Unfortunately, POSIX imposes requirements about how readdir() is supposed to work if files are added or deleted while the readdir() is in process. (Basically a file which is created or deleted during the readdir must appear once or not at all, and all other files must be returned exactly once.) This isn't too bad, except that this requirement must also be maintained even across a telldir() which saves a linear offset into the director, and seekdir() which seeks back to that location on disk. This interface is horribly broken, as it fundamentally assumes a linear linked list implementation such as was used three decades ago in Unix. And, it gives filesystem implementors nightmares when they are required to provide this interface even when they are trying to use more advanced data structures that no longer have a linear directory layout --- say, like a B-tree. Different filesystems solve this in different ways; some use multiple B-trees, with one B-tree only so that readdir() can have the proper semantics. This has the downside that file creations and deletions now have to update two separate trees. The choice which ext3 used was a simpler one, which is that we simply return files in hash sort order. This provides the correct semantics, but unfortunately it means that workloads which do a readdir() followed by a stat() of each file ends up accessing the inode table in an effectively random order. This can also happen if the inode table is fragmented, but this causes the worst case to happen every single time. There are solutions; and the simplest is to have programs read the entire directory into memory, and then sort by the list by inodes before actually stat'ing the file. This can be done in userspace much more easily than in the kernel, since userspace memory is swappable, and kernel memory is not. I have written an ld_preload which allows a program to do the right thing without needing to modify the program: http://www.redhat.com/archives/ext3-users/2004-September/msg00025.html Unfortunately, for programs that use telldir() and seekdir(), and hold on to the telldir() pointer for a long time, and still expect POSIX semantics, this will not necessarily work correctly, so it's not something I would recommend for the systemwide ld_preload. But it is useful for accelerating programs that haven't yet been modified, such as ls and find. Other programs, such as mutt's maildir handling, have already been so modified, and is a much better solution. (In fact, it provides speedup benefits on all filesystems, but just much more on ext3 filesystems with dir_tree enabled.) The fact that ext3 doesn't shrink directories is a long-standing Unix implementation restriction. It's not impossible for us to add support for truncating directories as files get deleted, but it's just never bubbled up to the top of the todo list; in practice, workloads that create gigantic directories that then shrink down to nothing are relatively rare. > If there are any ways to fix this kind of problem > without rebooting machine? I'm afraid of the commands > "rsync -avHn /tmp/ /new_tmp/; rm -rf /tmp/ && mv > /new_tmp/ /tmp" because other applications are > accessing /tmp/ as well. Not without rebooting, but probably it will required scheduled downtime where you kick all of the users off, and then recreate the tmp directory --- either using rsync, or just doing a plain old "rm -rf /tmp; mkdir /tmp". If users are expecting that files stick around in /tmp, that's huge cultural problem, and it will come back to haunt you in multiple ways.... - Ted _______________________________________________ Ext3-users mailing list Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users