Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Fri, 10 Apr 2009 10:35:47 +0800

On Fri, Apr 10, 2009 at 02:05:03AM +0800, Andi Kleen wrote:
> "Frantisek Rysanek" <Frantisek.Rysanek@xxxxxxx> writes:
> 
> > I don't understand all the tweakable knobs of mkfs.xfs - not well
> > enough to match the 4MB RAID chunk size somewhere in the internal
> > structure of XFS.
> 
> If it's software RAID recent mkfs.xfs should be able to figure
> it out the stripe sizes on its own.

A side note on Frantisek's "perfectly aligned 4MB readahead on 4MB
file allocation on 4MB RAID chunk size" proposal:
- 4MB IO size may be good for _disk_ bandwidth but not necessarily for
  the actual throughput of your applications because of latency issues.
- a (dirty) quick solution for your big-file servers is to use 16MB
  chunk size for software RAID and use 2MB readahead size. It won't
  suffer a lot from RAID5's partial write insufficiency, because
  - the write ratio is small
  - the writes are mostly sequential and can be write-back in busty
  The benefit for reads are, as long as XFS keeps the file blocks
  continuous, only 1 out of 8 readahead IO will involve two disks :-)

> > Another problem is, that there seems to be a single tweakable knob to 
> > read-ahead in Linux 2.6, accessible in several ways:
> >   /sys/block/<dev>/queue/max_sectors_kb
> >   /sbin/blockdev --setra
> >   /sbin/blockdev --setfra
> 
> unsigned long max_sane_readahead(unsigned long nr)
> {
>         return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
>                 + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> }
> 
> So you can affect it indirectly by keeping a lot of memory free
> with vm.min_free_kbytes. Probably not an optimal solution.

Of course, not even viable ;)

Here is the memory demand of concurrent readahead.  For 1MB readahead
size, each stream will require about 2MB memory to keep it safe from
readahead thrashing. So for a server with 1000 streams, 2GB is enough
for readahead.

My old adaptive readahead patches can significantly reduce this
requirement - e.g. cut that 2GB down to 500MB. However, who cares
(please speak out!)? Servers seem to be memory bounty nowadays..

> >
> > Based on some manpages on the madvise() and fadvise() functions, I'd 
> > say that the level of read-ahead corresponding to MADV_SEQUENTIAL and 
> > FADV_SEQUENTIAL is still decimal orders less than the desired figure.
> 
> Wu Fengguang (cc'ed) is doing a lot of work on the MADV_* readahead
> algorithms. There was a recent new patchkit from him on linux-kernel
> that you might try. It still uses strict limits, but it's better
> at figuring out specific patterns.
> 
> But then if you really know very well what kind of readahead
> is needed it might be best to just implement it directly in the
> applications than to rely on kernel heuristics.

File downloading servers typically run sequential reads/writes.
Which can be well served by the kernel readahead logic.

Apache/lighttpd have the option to do mmap reads. For these sequential
mmap read workloads, these new patches are expected to serve them well:
http://lwn.net/Articles/327647/

> For example for faster booting sys_readahead() is widely used
> now.

And the more portable/versatile posix_fadvise() advices :-)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html