On Oct 25, 2013, at 2:18 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <t.artem@xxxxxxxxx> wrote: >> >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 >> kernel built for the i686 (with PAE) and x86-64 architectures. What’s >> really troubling me is that the x86-64 kernel has the following problem: >> >> When I copy large files to any storage device, be it my HDD with ext4 >> partitions or flash drive with FAT32 partitions, the kernel first >> caches them in memory entirely then flushes them some time later >> (quite unpredictably though) or immediately upon invoking "sync". > > Yeah, I think we default to a 10% "dirty background memory" (and > allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB > of dirty memory for writeout before we even start writing, and twice > that before we start *waiting* for it. > > On 32-bit x86, we only count the memory in the low 1GB (really > actually up to about 890MB), so "10% dirty" really means just about > 90MB of buffering (and a "hard limit" of ~180MB of dirty). > > And that "up to 3.2GB of dirty memory" is just crazy. Our defaults > come from the old days of less memory (and perhaps servers that don't > much care), and the fact that x86-32 ends up having much lower limits > even if you end up having more memory. I think the “delay writes for a long time” is a holdover from the days when e.g. /tmp was on a disk and compilers had lousy IO patterns, then they deleted the file. Today, /tmp is always in RAM, and IMHO the “write and delete” workload tested by dbench is not worthwhile optimizing for. With Lustre, we’ve long taken the approach that if there is enough dirty data on a file to make a decent write (which is around 8MB today even for very fast storage) then there isn’t much point to hold back for more data before starting the IO. Any decent allocator will be able to grow allocated extents to handle following data, or allocate a new extent. At 4-8MB extents, even very seek-impaired media could do 400-800MB/s (likely much faster than the underlying storage anyway). This also avoids wasting (tens of?) seconds of idle disk bandwidth. If the disk is already busy, then the IO will be delayed anyway. If it is not busy, then why aggregate GB of dirty data in memory before flushing it? Something simple like “start writing at 16MB dirty on a single file” would probably avoid a lot of complexity at little real-world cost. That shouldn’t throttle dirtying memory above 16MB, but just start writeout much earlier than it does today. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html