Re: New XFS benchmarks using David Chinner's recommendations for XFS-based optimizations.

pg_lxra@xxxxxxxxxxxxxxxxxxx (Peter Grandi) · Mon, 31 Dec 2007 15:05:33 +0000

>> Why does mdadm still use 64k for the default chunk size?

> Probably because this is the best balance for average file
> sizes, which are smaller than you seem to be testing with?

Well "average file sizes" relate less to chunk sizes than access
patterns do. Single threaded sequential reads with smaller block
sizes tend to perform better with smaller chunk sizes, for example.
File sizes do influence a bit the access patterns seen by disks.

The goals of a chunk size choice are to spread a single
''logical'' (application or filesystem level) operation over as
many arms as possible while keeping rotational latency low, to
minimize arm movement, and to minimize arm contention among
different threads.

Thus the tradeoffs influencing chunk size are about the
sequential vs. random nature of reads or writes, how many blocks
are involved in a single ''logical'' operation, and how many
threads are accessing the array.

The goal here is not to optimize the speed of the array, but the
throughput and/or latency of the applications using it.

A good idea is to consider the two extremes: a chunk size of 1
sector and a chunk size of the whole disk (or perhaps more
interestingly 1/2 disk).

For example, consider a RAID0 of 4 disks ('a', 'b', 'c', 'd')
with each chunk size of 8 sectors.

To read the first 16 chunks or 128 sectors of the array these
sector read operations ['get(device,first,last)'] have to be
issued:

  00-31: get(a,0,7) get(b,0,7) get(c,0,7) get(d,0,7)
  32-63:   get(a,8,15) get(b,8,15) get(c,8,15) get(d,8,15)
  64-95:     get(a,16,23) get(b,16,23) get(c,16,23) get(d,16,23)
  96-127:      get(a,24,31) get(b,24,31) get(c,24,31) get(d,24,31)

I have indented the lists to show the increasing offset into
each block device.

Now, the big question here are all about the interval between
these operations, that is how large are logical operations and
how much they cluster in time and space.

For example in the above sequence it matters whether clusters of
operations involve less then 32 sectors or not, and the likely
interval between clusters generated by different concurrent
applications (consider rotational latency and likelyhood of arm
being moved between successive clusters).

So that space/time clustering depends more on how applications
process their data and how many applications concurrently access
the array, and whether they are reading or writing.

The latter point involves an exceedingly important asymmetry
that is often forgotten: an application read can only complete
when the last block is read, while a write can complete as soon
as it issued. So the time clustering of sector reads depends on
how long ''logical'' reads are as well as how long is the
interval between them.

So an application that issues frequent small reads rather than
infrequent large ones may work best with a small chunk size.

Not much related to distribution of file sizes, unless this
influences the space/time clustering of application issued
operations...

In general I like smaller chunk sizes than larger chunk sizes,
because the latter work well only in somewhat artificial cases
like simple-minded benchmarks.

In particular if one uses parity-based (not a good idea in
general...) arrays, as small chunk sizes (as well as stripe
sizes) give a better chance of reducing the frequency of RMW.

Counter to this that the Linux IO queueing subsystem (elevators
etc.) perhaps does not always take advantage of parallelizable
operations across disks as much as it could, and bandwidth
bottlenecks (e.g. PCI bus).

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html