Re: Problem about very high Average Read/Write Request Time

pg@xxxxxxxxxxxxxxxxxxx (Peter Grandi) · Fri, 24 Oct 2014 00:01:10 +0100

[ ... ]

>>>> [ ... ] if the device name "/data/fhgfs/fhgfs_storage" is
>>>> dedscriptive, this "brave" RAID5 set is supposed to hold
>>>> the object storage layer of a BeeFS highly parallel
>>>> filesystem, and therefore will likely have mostly-random
>>>> accesses.
[ ... ]
>>> Where do you get the assumption from that FhGFS/BeeGFS is
>>> going to do random reads/writes or the application of top of
>>> it is going to do that?

>> It is not a mere assumption in the general case either; it is
>> both commonly observed and a simple deduction, because of the
>> nature of distributed filesystems and in particular parallel
>> HPC ones like Lustre or BeeGFS, but also AFS and even NFS ones.
[ ... ]

>> * Clients have caches.

> Correct is: Client *might* have caches. Besides of application
> directio, for BeeGFS the cache type is a configuration option.

Perhaps you have missed the explicit qualification «in the
general case» of «distributed filesystems and in particular
parallel HPC ones» or perhaps you lack familiarity with «Lustre
or BeeGFS, but also AFS and even NFS ones» most of which have
client caches and usually enabled, and that might justify your
inability to consider «the general case».

>> Therefore most of the locality in the (read) access patterns
>> will hopefully be filtered out by the client cache. This
>> applies (ideally) to any distributed filesystem.

> You cannot filter out everything, e.g. random reads of a large
> file.

It is good but somewhat pointless that you can understand the
meaning of «most of the locality in the (read) access patterns
will hopefully be filtered out by the client cache» and agree
with it, and supply an example, but unfortunately you seem to
have the naive expectation that:

>> Local or remote file system does not matter here.

It can matter as:

* In the local case there is a single cache for all concurrent
  applications, while in the distributed case there is hopefully
  a separate cache per node, which segments the references (as
  well as hopefully providing a lot more cache space).
* In the purely local case there is usually just one level of
  caching, in the distribured case usually there are two levels,
  often resulting in rather different access patterns to the
  object stores in the server.

So the degree of filtering can be and often is quite different;
which is usually quite important because network transfers add a
cost.

As to these three comments I am perplexed:

>> Therefore it is likely that many clients will access with
>> concurrent read and write many different files on the same
>> server resulting in many random "hotspots" in each server's
>> load.

> If that would be important here there would be no difference
> between single write and parallel read/write.
[ ... ]
>> each client could be doing entirely sequential IO to each
>> file they access, but the concurrent accesses do possibly
>> widely scattered files will turn that into random IO at the
>> server level.
[ ... ]
> How does this matter if the op is comparing 1-thread write
> vs. 2-thread read/write?

>> * Eventually if the object server reaches a steady state
>> where roughly as much data is deleted and created the free
>> storage areas will become widely scattered, leading to
>> essentially random allocation, the more random the more
>> capacity used.

> All of that is irrelevant if a single write is fast and a
> parallel read/write is slow.

Because you seem rather confusedm, as my explanation was the
answer to this question you asked:

  >>> Where do you get the assumption from that FhGFS/BeeGFS is
  >>> going to do random reads/writes or the application of top
  >>> of it is going to do that?

and in it you mention no special case like «1-thread write», or
«2-thread read/write».

Also such simple special cases don't happen much in «the object
storage layer» of any realistic «highly parallel filesystem»,
which are often large with vast and varied workloads, as I tried
to remind you:

  >> HPC/parallel servers tend to whave many clients (e.g. for
  >> an it could be 10,000 clients and 500 object storage
  >> servers) and hopefully each client works on a different
  >> subset of the data tree, and distribution of data objects
  >> onto servers hopefully random.

Therefore there are likely to be many dozens or even hundreds of
threads accessing objects per object store, with every pattern
of read and write and to rather unrelated objects, not just 1 or
2 threads and single write or read/write.

That's one reason why XFS is so often used for those object
stores: it is particularly well suited to highly multithreaded
access patterns to many files, as the XFS has benefited from
quite a bit of effort in finer grained locking, and XFS uses
some mostly effective heuristics to distribute files across the
storage it uses in hopefully "best" ways.

>> The above issues are pretty much "network and distributed
>> filesystems for beginners" notes,

> It is lots of text

In my original reply I was terse and did not explain every
reason why «the object storage layer of a BeeFS highly parallel
filesystem» is «likely have mostly-random accesses» because I
assumed it is common knowledge among somewhat skilled readers;
but to a point I am also patient with beginners, even those who
seem to become confused about which question they themselves
asked.

Also I am trying to quote context because you seem confused as
to what the content of even your questions is.

> and does not help the op at all.

That seems unfortunately right, as to me you still seem very
confused as to the workloads likely experienced by object stores
for highly parallel filesystems despite my efforts in trying to
answer in detail the question you asked:

  >>> Where do you get the assumption from that FhGFS/BeeGFS is
  >>> going to do random reads/writes or the application of top
  >>> of it is going to do that?

At least as I already pointed out my answer to your question is
at least somewhat topical for the XFS list, for example by
hinting about using less "brave" configurations than 32-disk
RAID5 sets.

> And the claim/speculation that the parallel file system would
> introduce random access is also wrong.

As far as I can see it was only who you mentioned that because I
discussed just the consequences of the likely access patterns of
«the application of top of it» part of your question.

It seemed strange to me that you would ask why «FhGFS/BeeGFS is
going to do random reads/writes» because filesystems typically
don't do «read/writes» except as a consequence of application
requests, so I ignored that other part of your question.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs