On Fri, Apr 10, 2009 at 02:05:03AM +0800, Andi Kleen wrote: > "Frantisek Rysanek" <Frantisek.Rysanek@xxxxxxx> writes: > > > I don't understand all the tweakable knobs of mkfs.xfs - not well > > enough to match the 4MB RAID chunk size somewhere in the internal > > structure of XFS. > > If it's software RAID recent mkfs.xfs should be able to figure > it out the stripe sizes on its own. A side note on Frantisek's "perfectly aligned 4MB readahead on 4MB file allocation on 4MB RAID chunk size" proposal: - 4MB IO size may be good for _disk_ bandwidth but not necessarily for the actual throughput of your applications because of latency issues. - a (dirty) quick solution for your big-file servers is to use 16MB chunk size for software RAID and use 2MB readahead size. It won't suffer a lot from RAID5's partial write insufficiency, because - the write ratio is small - the writes are mostly sequential and can be write-back in busty The benefit for reads are, as long as XFS keeps the file blocks continuous, only 1 out of 8 readahead IO will involve two disks :-) > > Another problem is, that there seems to be a single tweakable knob to > > read-ahead in Linux 2.6, accessible in several ways: > > /sys/block/<dev>/queue/max_sectors_kb > > /sbin/blockdev --setra > > /sbin/blockdev --setfra > > unsigned long max_sane_readahead(unsigned long nr) > { > return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE) > + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); > } > > So you can affect it indirectly by keeping a lot of memory free > with vm.min_free_kbytes. Probably not an optimal solution. Of course, not even viable ;) Here is the memory demand of concurrent readahead. For 1MB readahead size, each stream will require about 2MB memory to keep it safe from readahead thrashing. So for a server with 1000 streams, 2GB is enough for readahead. My old adaptive readahead patches can significantly reduce this requirement - e.g. cut that 2GB down to 500MB. However, who cares (please speak out!)? Servers seem to be memory bounty nowadays.. > > > > Based on some manpages on the madvise() and fadvise() functions, I'd > > say that the level of read-ahead corresponding to MADV_SEQUENTIAL and > > FADV_SEQUENTIAL is still decimal orders less than the desired figure. > > Wu Fengguang (cc'ed) is doing a lot of work on the MADV_* readahead > algorithms. There was a recent new patchkit from him on linux-kernel > that you might try. It still uses strict limits, but it's better > at figuring out specific patterns. > > But then if you really know very well what kind of readahead > is needed it might be best to just implement it directly in the > applications than to rely on kernel heuristics. File downloading servers typically run sequential reads/writes. Which can be well served by the kernel readahead logic. Apache/lighttpd have the option to do mmap reads. For these sequential mmap read workloads, these new patches are expected to serve them well: http://lwn.net/Articles/327647/ > For example for faster booting sys_readahead() is widely used > now. And the more portable/versatile posix_fadvise() advices :-) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html