On Thu, Feb 28, 2013 at 4:54 PM, Brian Foster <bfoster at redhat.com> wrote: ... > > From skimming through the code and watching a writepage tracepoint, I > think the high-level situation is as follows: > > - write()'s to the loop (block) device hit page cache as buffers. This > data is subject to similar caching/writeback behavior as a local > filesystem (e.g., write returns when the data is cached; if the write is > sync, wait on a flush before returning). > - Flushing eventually kicks in, which is page based and results in a > bunch of writepage requests. The block/buffer handling code converts > these writepage requests into 4k I/O (bio) requests. > - These 4k I/O requests hit loop. In the file backed case, it issues > write() requests to the underlying file. > - In a local filesystem cache, I believe this would result in further > caching in the local filesystem mapping. In the case of fuse, requests > to userspace are submitted immediately, thus gluster is now receiving 4k > write requests rather than 128k requests when writing to the file > directly via dd with a 1MB buffer size. > > Given that you can reproduce the sync write variance without Xen, I > would rule that out for the time being and suggest the following: > > - Give the profile tool a try to compare the local loop case when > throughput is higher vs. lower. It would be interesting to see if > anything jumps out that could help explain what is happening differently > between the runs. > - See what caching/performance translators are enabled in your gluster > client graph (the volname-fuse.vol volfile) and see about disabling some > of those one at a time, e.g.: > > gluster volume set myvol io-cache disable > (repeat for write-behind, read-ahead, quick-read, etc.) > > ... and see if you get any more consistent results (good or bad). > - Out of curiosity (and if you're running a recent enough gluster), try > the fopen-keep-cache mount option on your gluster mount and see if it > changes any behavior, particularly with a cleanly mapped loop dev. > > Brian Thanks for taking the time to investigate and explain all this, I'm very grateful for that. I had two clients set up to do a 2GB sync write, then sleep 3600 seconds. The writes were to loop devices backed by a file on Gluster. The writes were staggered to avoid overlap, and the I created the loop devices with a few hours between them. Client 1: 00:50:49 - 2.1 GB copied, 53.2285 s, 39.4 MB/s 01:51:43 - 2.1 GB copied, 53.943 s, 38.9 MB/s 02:52:37 - 2.1 GB copied, 619.662 s, 3.4 MB/s 04:02:56 - 2.1 GB copied, 675.043 s, 3.1 MB/s Client 2: 00:12:27 - 2.1 GB copied, 51.0465 s, 41.1 MB/s 01:13:18 - 2.1 GB copied, 59.8881 s, 35.0 MB/s 02:14:18 - 2.1 GB copied, 817.714 s, 2.6 MB/s 03:27:55 - 2.1 GB copied, 609.257 s, 3.4 MB/s I have more data points, but the change is there. Somewhere between 01:51 and 02:14, the transfer rate seems to go from ~40MB/s to ~3MB/s. This is consistent with the findings from a couple of days ago, where things had gotten slow over night. Removing and re-adding the loop device gets us back to ~40MB/s, without me noticing any error messages or latency due to flushing. I didn't measure if there was any flushing going on, but "losetup -d" returned pretty much immediately. All writes are done with sync, so I don't quite understand how cache flushing comes in. On one client, I had a slow loop device from yesterday and a fresh, new loop device going. The transfer rate for the old was about ~3.5MB/s, while the new had ~40MB/s. I set up a script that cleared Gluster stats, did a 60-second dd write, ran "sync", and then outputted Gluster profiling stats. Gluster was idle except for this script. Looking at those results, we can clearly see that there is a difference in the behavior. $ ./parse-gluster-info.py slow-and-fast/auto-slow Brick: storage01.gluster.trollweb.net:/srv/gluster/brick1 Block size reads writes 4096b+ 0 49152 131072b+ 7 0 Brick: storage02.gluster.trollweb.net:/srv/gluster/brick1 Block size reads writes 4096b+ 0 49152 $ ./parse-gluster-info.py slow-and-fast/auto-fast Brick: storage01.gluster.trollweb.net:/srv/gluster/brick0 Block size reads writes 4096b+ 0 106 8192b+ 0 68 16384b+ 0 305 32768b+ 0 2261 65536b+ 0 21285 131072b+ 0 466 Brick: storage02.gluster.trollweb.net:/srv/gluster/brick0 Block size reads writes 4096b+ 0 106 8192b+ 0 68 16384b+ 0 305 32768b+ 0 2261 65536b+ 0 21285 131072b+ 22 466 We see these are on different bricks. I made new loop devices until I one was backed by brick1, the transfer rate was as expected: $ ./parse-gluster-info.py slow-and-fast/auto-fast-loop3 Brick: storage01.gluster.trollweb.net:/srv/gluster/brick1 Block size reads writes 4096b+ 0 165 8192b+ 0 99 16384b+ 0 70 32768b+ 0 805 65536b+ 0 19086 131072b+ 0 851 Brick: storage02.gluster.trollweb.net:/srv/gluster/brick1 Block size reads writes 4096b+ 0 165 8192b+ 0 99 16384b+ 0 70 32768b+ 0 805 65536b+ 0 19086 131072b+ 22 851 The script and gluster profile output can be downloaded at http://torbjorn-dev.trollweb.net/gluster/. To me it seems that a fresh loop device does mostly 64kb writes, and at some point during a 24 hour window, changes to doing 4kb writes ? -- Vennlig hilsen Torbj?rn Thorsen Utvikler / driftstekniker Trollweb Solutions AS - Professional Magento Partner www.trollweb.no Telefon dagtid: +47 51215300 Telefon kveld/helg: For kunder med Serviceavtale Bes?ksadresse: Luramyrveien 40, 4313 Sandnes Postadresse: Maurholen 57, 4316 Sandnes Husk at alle v?re standard-vilk?r alltid er gjeldende