small write speed problem on EBS, distributed replica

karol.skocik at gmail.com (karol skocik) · Wed, 23 Mar 2011 18:26:06 +0100

With FIO, raw write speed to EBS volume is like this:

test: (g=0): rw=write, bs=128K-128K/128K-128K, ioengine=sync, iodepth=8
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/43124K /s] [0 /329  iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=6406
  write: io=1024.0MB, bw=37118KB/s, iops=289 , runt= 28250msec
    clat (usec): min=58 , max=2222 , avg=78.20, stdev=25.17
     lat (usec): min=59 , max=2223 , avg=78.89, stdev=25.19
    bw (KB/s) : min= 7828, max=60416, per=104.72%, avg=38870.65, stdev=10659.43

Average bandwidth 38.8 MB/s
and average completion latency for IO request 78 microsecs.

Since FUSE module uses the same blocksize (128 KB) as I set up to use
in FIO, I would say the bandwidth for 2x2 replica could be around 15
MB/s and more, when 1 client wants to write 1GB file.

Currently, Gluster can go up to 22 MB/s - without replication, with 1 client.
But with distributed replica 2x2 on 4 machines, the number for 1
client writing 1 GB file goes down to 6.5 MB/s - that's the thing I
don't understand.

> I also suggest calculating network latency.

I measured individual latencies to server machines here:
dfs01: 402 microseconds
dfs02: 322 microseconds
dfs03: 445 microseconds
dfs04: 378 microseconds

I guess you mean some other - cumulative latency of a set of nodes? In
that case, how do I calculate it?

Karol

On Wed, Mar 23, 2011 at 5:56 PM, Mohit Anchlia <mohitanchlia at gmail.com> wrote:
> What were you really expecting the numbers to be? What no. do you get
> when you write directly to the ext3 file system bypassing GFS?
>
> I also suggest calculating network latency.
>
> On Wed, Mar 23, 2011 at 4:17 AM, karol skocik <karol.skocik at gmail.com> wrote:
>> I see my email to the list was truncated - sending it again.
>>
>> Hi,
>> ?here are the measurements - the client machine is KS, and server
>> machines are DFS0[1-4].
>> First, the setup now is:
>>
>> Volume Name: EBSOne
>> Type: Distribute
>> Status: Started
>> Number of Bricks: 1
>> Transport-type: tcp
>> Bricks:
>> Brick1: dfs01:/mnt/ebs
>>
>> With just one client machine writing 1GB file to EBSOne, averaged from 3 runs:
>>
>> Bandwidth (mean): 22441.84 KB/s
>> Bandwidth (deviation): 6059.24 KB/s
>> Completion latency (mean): 1274.47 KB/s
>> Completion latency (deviation): 1814.58 KB/s
>>
>> Now, the latencies:
>>
>> From KS (client machine) to DFS (server machines), averages of 3 runs.
>>
>> Latencies:
>> dfs01: 402 microseconds
>> dfs02: 322 microseconds
>> dfs03: 445 microseconds
>> dfs04: 378 microseconds
>>
>> Bandwidths:
>> dfs01: 54 MB/s
>> dfs02: 62.5 MB/s
>> dfs03: 64 MB/s
>> dfs04: 91.5 MB/s
>>
>> Every server machine has just 1 EBS drive, ext3 filesystem,
>> 2.6.18-xenU-ec2-v1.0 - CFQ IO scheduler.
>>
>> Any ideas? From the numbers above - does it have any sense to try to
>> make sw RAID0 with mdadm, or eventually use another filesystem?
>>
>> Thank you for help.
>> Regards Karol
>>
>> On Wed, Mar 23, 2011 at 11:31 AM, karol skocik <karol.skocik at gmail.com> wrote:
>>> Hi,
>>> ?here are the measurements - the client machine is KS, and server
>>> machines are DFS0[1-4].
>>> First, the setup now is:
>>>
>>> Volume Name: EBSOne
>>> Type: Distribute
>>> Status: Started
>>> Number of Bricks: 1
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: dfs01:/mnt/ebs
>>>
>>> With just one client machine writing 1GB file to EBSOne, averaged from 3 runs:
>>>
>>> Bandwidth (mean): 22441.84 KB/s
>>> Bandwidth (deviation): 6059.24 KB/s
>>> Completion latency (mean): 1274.47 KB/s
>>> Completion latency (deviation): 1814.58 KB/s
>>>
>>> Now, the latencies:
>>>
>>> From KS (client machine) to DFS (server machines), averages of 3 runs.
>>>
>>> Latencies:
>>> dfs01: 402 microseconds
>>> dfs02: 322 microseconds
>>> dfs03: 445 microseconds
>>> dfs04: 378 microseconds
>>>
>>> Bandwidths:
>>> dfs01: 54 MB/s
>>> dfs02: 62.5 MB/s
>>> dfs03: 64 MB/s
>>> dfs04: 91.5 MB/s
>>>
>>> Every server machine has just 1 EBS drive, ext3 filesystem,
>>> 2.6.18-xenU-ec2-v1.0 - CFQ IO scheduler.
>>>
>>> Any ideas? From the numbers above - does it have any sense to try to
>>> make sw RAID0 with mdadm, or eventually use another filesystem?
>>>
>>> Thank you for help.
>>> Regards Karol
>>>
>>> On Tue, Mar 22, 2011 at 6:08 PM, Mohit Anchlia <mohitanchlia at gmail.com> wrote:
>>>> Can you first run some test with no replica and see what results you
>>>> get? Also, can you look at network latency from client to each of your
>>>> 4 servers and post the results?
>>>>
>>>> On Mon, Mar 21, 2011 at 1:27 AM, karol skocik <karol.skocik at gmail.com> wrote:
>>>>> Hi,
>>>>> ?I am in the process of evaluation of Gluster for major BI company,
>>>>> but I was surprised by very small write performance on Amazon EBS.
>>>>> Our setup is Gluster 3.1.2, distributed replica 2x2 on 64-bit m1.large
>>>>> instances. Every server node has 1 EBS volume attached to it.
>>>>> The configuration of the distributed replica is a default one, my
>>>>> small attemps to improve performance (io-threads, disabled io-stats
>>>>> and latency-measurement):
>>>>>
>>>>> volume EBSVolume-posix
>>>>> ? ?type storage/posix
>>>>> ? ?option directory /mnt/ebs
>>>>> end-volume
>>>>>
>>>>> volume EBSVolume-access-control
>>>>> ? ?type features/access-control
>>>>> ? ?subvolumes EBSVolume-posix
>>>>> end-volume
>>>>>
>>>>> volume EBSVolume-locks
>>>>> ? ?type features/locks
>>>>> ? ?subvolumes EBSVolume-access-control
>>>>> end-volume
>>>>>
>>>>> volume EBSVolume-io-threads
>>>>> ? ?type performance/io-threads
>>>>> ? ?option thread-count 4
>>>>> ? ?subvolumes EBSVolume-locks
>>>>> end-volume
>>>>>
>>>>> volume /mnt/ebs
>>>>> ? ?type debug/io-stats
>>>>> ? ?option log-level NONE
>>>>> ? ?option latency-measurement off
>>>>> ? ?subvolumes EBSVolume-io-threads
>>>>> end-volume
>>>>>
>>>>> volume EBSVolume-server
>>>>> ? ?type protocol/server
>>>>> ? ?option transport-type tcp
>>>>> ? ?option auth.addr./mnt/ebs.allow *
>>>>> ? ?subvolumes /mnt/ebs
>>>>> end-volume
>>>>>
>>>>> In our test, all clients starts writing to different 1GB file at the same time.
>>>>> The measured write bandwidth, with 2x2 servers:
>>>>>
>>>>> 1 client: 6.5 MB/s
>>>>> 2 clients: 4.1 MB/s
>>>>> 3 clients: 2.4 MB/s
>>>>> 4 clients: 4.3 MB/s
>>>>>
>>>>> This is not acceptable for our needs. With PVFS2 (I know it's
>>>>> stripping which is very different from replica) we can get up to 35
>>>>> MB/s.
>>>>> 2-3 times slower than that would be understandable. But 5-15 times
>>>>> slower is not, and I would like to know whether there is something we
>>>>> could try out.
>>>>>
>>>>> Could anybody publish their write speeds on similar setup, and tips
>>>>> how to achieve better performance?
>>>>>
>>>>> Thank you,
>>>>> ?Karol
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>>>>
>>>>
>>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>
>