Performance in VM guests when hosting VM images on Gluster

torbjorn at trollweb.no (Torbjørn Thorsen) · Fri, 1 Mar 2013 17:48:14 +0100

On Thu, Feb 28, 2013 at 4:54 PM, Brian Foster <bfoster at redhat.com> wrote:
...
>
> From skimming through the code and watching a writepage tracepoint, I
> think the high-level situation is as follows:
>
> - write()'s to the loop (block) device hit page cache as buffers. This
> data is subject to similar caching/writeback behavior as a local
> filesystem (e.g., write returns when the data is cached; if the write is
> sync, wait on a flush before returning).
> - Flushing eventually kicks in, which is page based and results in a
> bunch of writepage requests. The block/buffer handling code converts
> these writepage requests into 4k I/O (bio) requests.
> - These 4k I/O requests hit loop. In the file backed case, it issues
> write() requests to the underlying file.
> - In a local filesystem cache, I believe this would result in further
> caching in the local filesystem mapping. In the case of fuse, requests
> to userspace are submitted immediately, thus gluster is now receiving 4k
> write requests rather than 128k requests when writing to the file
> directly via dd with a 1MB buffer size.
>
> Given that you can reproduce the sync write variance without Xen, I
> would rule that out for the time being and suggest the following:
>
> - Give the profile tool a try to compare the local loop case when
> throughput is higher vs. lower. It would be interesting to see if
> anything jumps out that could help explain what is happening differently
> between the runs.
> - See what caching/performance translators are enabled in your gluster
> client graph (the volname-fuse.vol volfile) and see about disabling some
> of those one at a time, e.g.:
>
>         gluster volume set myvol io-cache disable
>         (repeat for write-behind, read-ahead, quick-read, etc.)
>
> ... and see if you get any more consistent results (good or bad).
> - Out of curiosity (and if you're running a recent enough gluster), try
> the fopen-keep-cache mount option on your gluster mount and see if it
> changes any behavior, particularly with a cleanly mapped loop dev.
>
> Brian

Thanks for taking the time to investigate and explain all this,
I'm very grateful for that.

I had two clients set up to do a 2GB sync write, then sleep 3600 seconds.
The writes were to loop devices backed by a file on Gluster.
The writes were staggered to avoid overlap, and the I created
the loop devices with a few hours between them.

Client 1:
00:50:49 - 2.1 GB copied, 53.2285 s, 39.4 MB/s
01:51:43 - 2.1 GB copied, 53.943 s, 38.9 MB/s
02:52:37 - 2.1 GB copied, 619.662 s, 3.4 MB/s
04:02:56 - 2.1 GB copied, 675.043 s, 3.1 MB/s

Client 2:
00:12:27 - 2.1 GB copied, 51.0465 s, 41.1 MB/s
01:13:18 - 2.1 GB copied, 59.8881 s, 35.0 MB/s
02:14:18 - 2.1 GB copied, 817.714 s, 2.6 MB/s
03:27:55 - 2.1 GB copied, 609.257 s, 3.4 MB/s

I have more data points, but the change is there.
Somewhere between 01:51 and 02:14, the transfer rate
seems to go from ~40MB/s to ~3MB/s.

This is consistent with the findings from a couple of days ago,
where things had gotten slow over night.

Removing and re-adding the loop device gets us back to ~40MB/s,
without me noticing any error messages or latency due to flushing.
I didn't measure if there was any flushing going on, but "losetup -d"
returned pretty much immediately.
All writes are done with sync, so I don't quite understand how cache
flushing comes in.

On one client, I had a slow loop device from yesterday and a fresh,
new loop device going.
The transfer rate for the old was about ~3.5MB/s, while the new had ~40MB/s.

I set up a script that cleared Gluster stats, did a 60-second dd
write, ran "sync",
and then outputted Gluster profiling stats. Gluster was idle except
for this script.
Looking at those results, we can clearly see that there is a
difference in the behavior.

$ ./parse-gluster-info.py slow-and-fast/auto-slow
Brick: storage01.gluster.trollweb.net:/srv/gluster/brick1
Block size           reads      writes
4096b+               0          49152
131072b+             7          0

Brick: storage02.gluster.trollweb.net:/srv/gluster/brick1
Block size           reads      writes
4096b+               0          49152

$ ./parse-gluster-info.py slow-and-fast/auto-fast
Brick: storage01.gluster.trollweb.net:/srv/gluster/brick0
Block size           reads      writes
4096b+               0          106
8192b+               0          68
16384b+              0          305
32768b+              0          2261
65536b+              0          21285
131072b+             0          466

Brick: storage02.gluster.trollweb.net:/srv/gluster/brick0
Block size           reads      writes
4096b+               0          106
8192b+               0          68
16384b+              0          305
32768b+              0          2261
65536b+              0          21285
131072b+             22         466

We see these are on different bricks.
I made new loop devices until I one was backed by brick1,
the transfer rate was as expected:
$ ./parse-gluster-info.py slow-and-fast/auto-fast-loop3
Brick: storage01.gluster.trollweb.net:/srv/gluster/brick1
Block size           reads      writes
4096b+               0          165
8192b+               0          99
16384b+              0          70
32768b+              0          805
65536b+              0          19086
131072b+             0          851

Brick: storage02.gluster.trollweb.net:/srv/gluster/brick1
Block size           reads      writes
4096b+               0          165
8192b+               0          99
16384b+              0          70
32768b+              0          805
65536b+              0          19086
131072b+             22         851

The script and gluster profile output can be
downloaded at http://torbjorn-dev.trollweb.net/gluster/.

To me it seems that a fresh loop device does mostly 64kb writes,
and at some point during a 24 hour window, changes to doing 4kb writes ?

--
Vennlig hilsen
Torbj?rn Thorsen
Utvikler / driftstekniker

Trollweb Solutions AS
- Professional Magento Partner
www.trollweb.no

Telefon dagtid: +47 51215300
Telefon kveld/helg: For kunder med Serviceavtale

Bes?ksadresse: Luramyrveien 40, 4313 Sandnes
Postadresse: Maurholen 57, 4316 Sandnes

Husk at alle v?re standard-vilk?r alltid er gjeldende