Re: bigalloc and max file size

Theodore Tso <tytso@xxxxxxx> · Sun, 30 Oct 2011 15:49:55 -0400

On Oct 30, 2011, at 1:37 AM, Coly Li wrote:

> Forgive me if this is out of topic.
> In our test, allocating directories W/ bigalloc and W/O inline-data may occupy most of disk space. By now Ext4
> inline-data is not merged yet, I just wondering how Google uses bigalloc without inline-data patch set ?

It depends on how many directories you have (i.e, how deep your directory structure is) and how many small files you have in the file system as to whether bigalloc w/o inline-data has an acceptable overhead or not.

As I've noted before, for at least the last 7-8 years, and probably a decade, average seek times for 7200rpm drives have remained constant at 10ms, even as disk capacities have grown from 200GB in 2004, to 3TB in 2011.   Yes, you can spin the platters faster, but the energy requirements go up with the square of the revolutions per minute, while the seek times only go up linearly; and so platter speeds don't get any faster than 15000rpm due to diminishing returns, and in fact some "green" drives only go at 5400rpm or even slower (interestingly enough, they tend not to advertise either the platter speed or the average seek time; funny, that….)

At 10ms per seek, that means that if the HDD isn't doing _anything_ else, it can do at most 100 seeks per second.   Hence, if you have a workload where latency is at a premium, as disk capacities grow, disks are effectively getting slower for  given data set size.  For example, in 2004, if you wanted to serve 5TB of data, you'd need 25 200GB disks, so you had at your disposal 2500 random read/write operations per second at your disposal.  In 2011, with 3TB disks, you'd have an order of magnitude fewer random writes when you only need to use 2 HDD's.   (Yes, you could use flash, or flash-backed cache, but if the working set is really large this can get very expensive, so it's not a solution suitable for all situations.)

Another way of putting things is if latency really matters, and you have a random read/write workload, capacity management can become more about seeks than actual number of gigabytes.  Hence, "wasting" space by using a larger cluster size may be a win if you are doing a large number of block allocations/deallocations, and memory pressure keeps on throwing the block bitmaps out of memory, so you have to keep seeking to read them back into memory.  By using a large cluster size, we reduce fragmentation, and we reduce the number of block bitmaps, which makes them more likely to stay in memory.

Furthermore, reducing the number of the bitmap blocks makes it more tenable to pin them in memory, if there is a desire to guarantee that they stay in memory.   (Dave Chinner was telling me that XFS manages its own metadata block lifespan, with its own shrinkers, instead of leaving when cached metadata gets ejected from memory.  That might be worth doing at some point in ext4, but of course that would add complexity as well.)

The bottom line is that if you are seek constrained, wasting space by using a large cluster size may not be a huge concern.   And if nearly all of your files are larger than 1MB, with many significantly larger, in-line data isn't going to help you a lot.

On the other hand, it may be that using 128 byte inode is a bigger win than using a larger inode size and storing the data in the inode table.   Using a small inode size reduces metadata I/O by doubling the number of inodes/block compared to a 256 byte inode, never mind a 1k or 4k inode.   Hence, if you don't need extended attributes or ACL's or sub-second timestamp resolution, you might want to consider using 128 byte inodes as possibly being a bigger win than in-line data.   All of this requires benchmarking with your specific workload, of course.

I'm not against your patch set, however; I just haven't had time to look at them, at all (nor the secure delete patch set, etc.) .   Between organizing the kernel summit, the kernel.org compromise, and some high priority bugs at $WORK, things have just been too busy.  Sorry for that; I'll get to them after the merge window and post-merge bug fixing is under control.

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html