Re: Allocation strategy - dynamic zone for small files

Theodore Tso <tytso@xxxxxxx> · Mon, 13 Nov 2006 20:02:35 -0500

On Mon, Nov 13, 2006 at 09:46:01AM -0800, Bryan Henderson wrote:
> >Does anyone have any estimates of how much space is wasted by these
> >files without making them a special case?  It seems to me that most
> >people have huge disks and don't really care about losing a few KB here
> >and there (especially if it makes more common cases slower).
> 
> Two thoughts:
> 
> 1) It's not just disk capacity.  Using a 4K disk block for 16 bytes of 
> data also wastes the time it takes to drag that 4K from disk to memory and 
> cache space.
> 
> 2) Making more efficient storage and access of _existing_ sets of files 
> isn't usually the justification for this technology.  It's enabling new 
> kinds of file sets.  Imagine all the 16 byte files that never got created 
> because the designer didn't want to waste 4K on each.  A file with  a 
> million 16 byte pieces might work better with a million separate files, 
> but was made a single file because 64 GB of storage for 16 MB of data was 
> not practical.  Similarly, there are files that would work better with 1 
> MB blocks, but have 4K blocks anyway, because the designer couldn't afford 
> 1 MB for every 16 byte file.

More thoughts:

1) It's not just about storage efficiency, but also about transfer
efficiency.  Disk drives generally like to transfer hunks of data in
16k to 64k at a time.  So getting related pieces of small hunks of
data read at the same time, we can win big on performance.  BUT, it's
extremely hard to do this at the filesystem level, since the
application is much more likely to know which micro-file of 16 bytes
is likely to be needed at the same time as some other micro-file which
is only 16 bytes long.

2) If you have millions of separate files, each 16 bytes long, and you
need to read a huge number of them, you can end up getting killed on
system call overhead.  

I remember having this argument with Hans Reiser at one point.  His
argument was that parsing was evil; and should never have to be done.
(And if anyone has ever seen the vast quanties of garbage which is
generated when you implement an XML parser in Java, and the resulting
GC overhead I can't blame them for thinking this...)  So his argument
was that instead of parsing a file like /etc/inetd.conf, there should
be an /etc/inetd.conf.d directory, and in that directory there might
be directory called telnet, and another one called ssh, and yet
another called smtp, and then you might have files such as:

FILENAME					CONTENTS
===============================================================

/etc/inetd.conf.d/telnet/port			23
/etc/inetd.conf.d/telnet/protocol		tcp
/etc/inetd.conf.d/telnet/flags			nowait
/etc/inetd.conf.d/telnet/user			root
/etc/inetd.conf.d/telnet/daemon			/sbin/telnetd

/etc/inetd.conf.d/ssh/port			22
/etc/inetd.conf.d/ssh/protocol			tcp
/etc/inetd.conf.d/ssh/flags			nowait
/etc/inetd.conf.d/ssh/user			root
/etc/inetd.conf.d/ssh/daemon			/sbin/sshd

etc.  When I pointed out the system call overhead that would result
since instead of an open, read, close to read /etc/inetd.conf, you
would now need perhaps a hundred or more system calls do the
opendir/readir loop, and then individually opening, reading, and
closing each file, Hans had a solution ---- a new system call where
you could download a byte coded language of commands program into the
kernel, so the kernel could execute a sequence of commands and return
to userspace a single buffer containing the contents of all of the
files, which could then be parsed by the userspace program.....

But wait a second, I thought the whole point of this complicated
scheme, including implementing a byte code interpreter in the kernel
with all of the attendent potential security issues, was to avoid
needing to do parsing.  Oops, oh well, so much for that idea.

So color me skeptical that 16 byte files are really such a great
design...

						- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html