Jeff Darcy wrote:
On 01/31/2010 09:06 AM, Ran wrote:
You guys are talking about network IO im taking about the gluster server disk IO
the idea to shape the trafic does make sence seens the virt machines
server do use network to get to the disks(gluster)
but what about if there are say 5 KVM servers(with VPS's) all on
gluster what do you do then ? its not quite fair share seens every
server has its own fair share and doesnt see the others .
Also there are other applications that uses gluster like mail etc..
and i see that gluster IO is very high very often cousing the all
storage not to work .
Its very disturbing .
You bring up a good set of points. Some of these problems can be
addressed at the hypervisor (i.e. GlusterFS client) level, some can be
addressed by GlusterFS itself, and some can be addressed only at the
level of the local-filesystem or block-device level on the GlusterFS
servers.
That sentence doesn't really parse for me. A part of the problem is that
Ran didn't really specify what his storage setup is (DAS in the host or
SAN), and whether the "uses up all disk I/O" is referring to it using up
all the available disk I/O on just the local virtualization host (DAS)
or whether the access pattern from one server is eating all the disk I/O
for all the other servers connected to the SAN. Obviously, one is more
pathological than the other, but without knowing the details it is
impossible to point the finger at gluster when the problem could be more
deeply rooted (e.g. a mis-optimization of the RAID array). Optimizing
file systems is a relatively complex thing and a lot of the conventional
wisdom is just plain wrong at times.
Here's an article I wrote on the subject a while back:
http://www.altechnative.net/e107_plugins/content/content.php?content.11
I'm not sure how much of this is applicable to the specific case being
discussed but I cannot help but wonder just how many (if any at all)
"enterprise grade" storage solutions take all of what is mentioned there
into account. In my experience the difference in I/O throughput can be
quite staggering, especially for random I/O.
Unfortunately, I/O traffic shaping is still in its infancy
compared to what's available for networking - or perhaps even "infancy"
is too generous. As far as the I/O stack is concerned, all of the
traffic is coming from the glusterfsd process(es) without
differentiation, so even if the functionality to apportion I/O amongst
tasks existed it wouldn't be usable without more information. Maybe
some day...
I don't think this would even be useful. It sounds like seeking more
finely grained (sub-process level!) control over disk I/O prioritisation
without there even being a clearly presented case about the current
functionality (ionice) not being sufficient.
If you are running a glfs server in a guest VM, and that VM is consuming
all of the disk I/O available to the host, then the guest VM container
process (qemu for qemu or KVM, vmx for vmware, etc.) can be ionice-d to
lower it's priority and give the other VMs more share of the disk I/O. I
haven't heard an argument yet explaining why that is not sufficient in
this case.
What you can do now at the GlusterFS level, though, is make sure that
traffic is distributed across many servers and possibly across many
volumes per server to take advantage of multiple physical disks and/or
interconnects for one server. That way, a single VM will only use a
small subset of the servers/volumes and will not starve other clients
that are using different servers/volumes (except for network bottlenecks
which are a separate issue). That's what the "distribute" translator is
for, and it can be combined with replicate or stripe to provide those
functions as well. Perhaps it would be useful to create and publish
some up-to-date recipes for these sorts of combinations.
Hold on, you seem to be talking about something else here. You're
talking about clients not distributing their requests evenly across
servers. Is that really what the original problem was about? My
understanding or the original post was that a glfs server VM (KVM) was
consuming more than it's fair share of disk I/O capability, and that
there was a need to throttle it - which can be done by applying ionice
to the qemu container process.
Given that this has been pretty much ignored, I'm guessing that I'm
missing the point and that my understanding of the problem being
experienced is in some way incorrect. So can we have some clarification
on it, with the explanation of why ionice-ing the qemu process isn't
applicable? What other feature is required and why exactly would it be
useful?
Gordan