Re: An idea for supporting large directories and readdir+stat workloads

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Aug 12, 2012 at 11:11:32PM -0600, Andreas Dilger wrote:
> Essentially, this is storing the inode in the directory.

Well, except we don't need to store *all* of the inode in the
directory.  My proposal was to include just the bits which are
necessary for stat(2) to function, which is about 64 bytes including
the SELinux SID.

If we store the "compact inode" in the directory, for an average file
name length of, say, 12 bytes, we can still store some 50-52 directory
entries in each 4k block.  If the htree code directs us to the correct
leaf block, we can still do lookups and stats very quickly.

> What might be possible is to have the directory leaf block point be
> the same thing as the inode table block, and then use an xattr in
> the inode to hold the parent directory number and filename as the
> "dirent".

There are a couple of problems I see with this approach; one i show do
we handle hash collisions in this new scheme?  You could put multiple
entries in the index block, one for each inode table block, I suppose.

Another issue is that it will slow down a pure readdir() only workload
significantly, since it will require a large number of random reads
into the inode table.  Even a readdir+stat workload will require many
more random reads than the "compact inode" in the directory entry
which I propose.

> I think that by the time we store all of the "stat" attributes into the
> directory it would essentially duplicate the inode, and increase overhead
> instead of reducing it.  Every inode update would likely require also
> updating the directory (mtime, ctime, etc).

What I was proposing was in the case where the inode had only a single
hard link (the common case), the information in the "compat inode"
(i.e., in the directory entry) would take precedence over what was
stored in the inode table.  This means that we would only need to
update the directory block -- and these are the fields that we would
need to store:

Field		Size

ino		 4
dev		 4
mode		 4
uid		 4
gid		 4
size		 8
atime		 8
mtime		 8
ctime		 8
blocks		10
selinux sid	 4
               ====
TOTAL		62

So basically, instead of updating the inode table, we would have to
update the directory block instead.  So there wouldn't be any extra
I/O overhead.  In terms of increasing the size of the directory, it
would certainly do that; but I think the speed improvements could very
well be worth it.

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux