Re: Problem about very high Average Read/Write Request Time

pg@xxxxxxxxxxxxxxxxxxx (Peter Grandi) · Tue, 21 Oct 2014 19:27:26 +0100

>> [ ... ]  supposed to hold the object storage layer of a BeeFS
>> highly parallel filesystem, and therefore will likely have
>> mostly-random accesses.

> Where do you get the assumption from that FhGFS/BeeGFS is
> going to do random reads/writes or the application of top of
> it is going to do that?

In this specific case it is not an assumption, thanks to the
prominent fact that the original poster was testing (locally I
guess) and complaining about concurrent read/writes, which
result in random like arm movement even if each of the read and
write streams are entirely sequential. I even pointed this out,
probably not explicitly enough:

  >> when doing only reading / only writing , the speed is very
  >> fast(~1.5G), but when do both the speed is very slow
  >> (100M), and high r_await(160) and w_await(200000).

  BTW the 100MB/s aggregate over 31 drives means around 3MB/s
  per drive, which seems pretty good for a RW workload with
  mostly-random accesses with high RMW correlation.

Also if this testing was appropriate then it was because the
intended workload was indeed concurrent reads and writes to the
object store.

It is not a mere assumption in the general case either; it
is both commonly observed and a simple deduction, because of
the nature of distributed filesystems and in particular parallel
HPC ones like Lustre or BeeGFS, but also AFS and even NFS ones.

* Clients have caches. Therefore most of the locality in the
  (read) access patterns will hopefully be filtered out by the
  client cache. This applies (ideally) to any distributed
  filesystem.
* HPC/parallel servers tend to whave many clients (e.g. for an
  it could be 10,000 clients and 500 object storage servers) and
  hopefully each client works on a different subset of the data
  tree, and distribution of data objects onto servers hopefully
  random.
  Therefore it is likely that many clients will access with
  concurrent read and write many different files on the same
  server resulting in many random "hotspots" in each server's
  load.
  Note that each client could be doing entirely sequential IO to
  each file they access, but the concurrent accesses do possibly
  widely scattered files will turn that into random IO at the
  server level.

Just about the only case where sequential client workloads don't
become random workloads at the server is when the client
workload is such that only one file is "hot" per server.

There is an additional issue favouring random access patterns:

  * Typically large fileservers are setup with a lot of storage
    because of anticipated lifetime usage, so they start mostly
    empty.
  * Most filesystems then allocate new data in regular patterns,
    often starting from the beginning of available storage, in
    an attempt to minimize arm travel time usually (XFS uses
    various heuristics, which are somewhat different whether the
    option 'inode64' is specified or not).
  * Unfortunately as the filetree becomes larger new allocations
    have to be made farther away, resulting in longer travel
    times and more apparent randomness at the storage server
    level.
  * Eventually if the object server reaches a steady state where
    roughly as much data is deleted and created the free storage
    areas will become widely scattered, leading to essentially
    random allocation, the more random the more capacity used.

  Leaving a significant percentage of capacity free, like at
  least 10% and more like 20%, greatly increases the chance of
  finding free space near to put new data near to existing
  "related" data. This increases locality, but only at the
  single-stream level; therefore is usually does not help that
  much widely shared distributed servers; and in particular does
  not apply that much to object stores, because usually they
  obscure which data object is related to which data object.

The above issues are pretty much "network and distributed
filesystems for beginners" notes, but in significant part also
apply to widely shared non network and non distributed servers
on which XFS is often used, so they may be usefully mentioned
in this list.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs