Linus Torvalds wrote:
On Mon, 13 Apr 2009, Avi Kivity wrote:
- create a big file,
Just creating a 5GB file in a 64KB filesystem was interesting - Windows
was throwing out 256KB I/Os even though I was generating 1MB writes (and
cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4).
Heh, ok. So the "big file" really only needed to be big enough to not be
cached, and 5GB was probably overkill. In fact, if there's some way to
blow the cache, you could have made it much smaller. But 5G certainly
works ;)
I wanted to make sure my random writes later don't get coalesced. A 1GB
file, half of which is cached (I used a 1GB guest), offers lots of
chances for coalescing if Windows delays the writes sufficiently. At
5GB, Windows can only cache 10% of the file, so it will be continuously
flushing.
(a) Windows caches things with a 4kB granularity, so the 512-byte write
turned into a read-modify-write
[...]
You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for
your example!). It's a total disaster. Imagine what would happen to user
application performance if kmalloc() always returned 16kB-aligned chunks
of memory, all sized as integer multiples of 16kB? It would absolutely
_suck_. Sure, it would be fine for your large allocations, but any time
you handle strings, you'd allocate 16kB of memory for any small 5-byte
string. You'd have horrible cache behavior, and you'd run out of memory
much too quickly.
The same is true in the kernel. The single biggest memory user under
almost all normal loads is the disk cache. That _is_ the normal allocator
for any OS kernel. Everything else is almost details (ok, so Linux in
particular does cache metadata very aggressively, so the dcache and inode
cache are seldom "just details", but the page cache is still generally the
most important part).
So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane
system does that. It's only useful if you absolutely _only_ work with
large files - ie you're a database server. For just about any other
workload, that kind of granularity is totally unnacceptable.
So doing a read-modify-write on a 1-byte (or 512-byte) write, when the
block size is 4kB is easy - we just have to do it anyway.
Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is
also _doable_, and from the IO pattern standpoint it is no different. But
from a memory allocation pattern standpoint it's a disaster - because now
you're always working with chunks that are just 'too big' to be good
building blocks of a reasonable allocator.
If you always allocate 64kB for file caches, and you work with lots of
small files (like a source tree), you will literally waste all your
memory.
Well, no one is talking about 64KB granularity for in-core files. Like
you noticed, Windows uses the mmu page size. We could keep doing that,
and still have 16KB+ sector sizes. It just means a RMW if you don't
happen to have the adjoining clean pages in cache.
Sure, on a rotating disk that's a disaster, but we're talking SSD here,
so while you're doubling your access time, you're doubling a fairly
small quantity. The controller would do the same if it exposed smaller
sectors, so there's no huge loss.
We still lose on disk storage efficiency, but I'm guessing that a modern
tree with some object files with debug information and a .git directory
it won't be such a great hit. For more mainstream uses, it would be
negligible.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html