glusterfs, striped volume x8, poor sequential read performance, good write performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have a HPC installation with 8 nodes. Each node has a software RAID1 using two NLSAS disks. And the disks from 8 nodes are combined into large shared striped 20Tb glusterfs partition which seems to show abnormally slow sequential read performance, with good write performance.

Basically I see is that the write performance is very decent ~ 500Mb/sec (tested using dd):

[root@XXXX bigstor]# dd if=/dev/zero of=test2 bs=1M count=100000 100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 186.393 s, 563 MB/s

And all this is is not just seating in the cache of each node, as I see the data being flushed to disks with approximately right speed.

In the same time the read performance is (tested using dd with dropping of the caches beforehand) is really bad:

[root@XXXX bigstor]# dd if=/data/bigstor/test of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 309.821 s, 33.8 MB/s

When doing this glusterfs processes only take ~ 10-15% of the CPU max. So it isn't CPU starving.

The underlying devices do not seem to be loaded at all: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda               0.00     0.00   73.00    0.00  9344.00     0.00   256.00     0.11    1.48   1.47  10.70

To check that the disks are not the problem I did a separate test of the read-speed of the raided disks on all machines and they have read speads of ~ 180Mb/s (uncached). So they aren't the problem.

I also tried to increase the readahead on the raid disks
echo 2048 > /sys/block/md126/queue/read_ahead_kb
but that doesn't seem to help at all.

Does anyone have any advice what to do here ? What knobs to adjust ? To me it looks like a bug, being honest, but I would be happy if there is magic switch I forgot to turn on )

Here is more details about my system

OS: Centos 6.5
glusterfs : 3.4.4
Kernel 2.6.32-431.20.3.el6.x86_64
mount options and df output:

[root@XXXX bigstor]# cat /etc/mtab

/dev/md126p4 /data/glvol/brick1 xfs rw 0 0
node1:/glvol /data/bigstor fuse.glusterfs  rw,default_permissions,allow_other,max_read=131072 0 0

[root@XXXX bigstor]# df
Filesystem       1K-blocks        Used  Available Use% Mounted on
/dev/md126p4    2516284988  2356820844  159464144  94% /data/glvol/brick1
node1:/glvol   20130279808 18824658688 1305621120  94% /data/bigstor

brick info:
xfs_info  /data/glvol/brick1
meta-data=/dev/md126p4 isize=512 agcount=4, agsize=157344640 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=629378560, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=307313, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


Here is the gluster info:
[root@XXXXXX bigstor]# gluster
gluster> volume info glvol

Volume Name: glvol
Type: Stripe
Volume ID: 53b2f6ad-46a6-4359-acad-dc5b6687d535
Status: Started
Number of Bricks: 1 x 8 = 8
Transport-type: tcp
Bricks:
Brick1: node1:/data/glvol/brick1/brick
Brick2: node2:/data/glvol/brick1/brick
Brick3: node3:/data/glvol/brick1/brick
Brick4: node4:/data/glvol/brick1/brick
Brick5: node5:/data/glvol/brick1/brick
Brick6: node6:/data/glvol/brick1/brick
Brick7: node7:/data/glvol/brick1/brick
Brick8: node8:/data/glvol/brick1/brick

The network I use is the ip over infiniband with very high throughput.

I also saw the discussion here on similar issue:
http://supercolony.gluster.org/pipermail/gluster-users/2013-February/035560.html
but it was blamed on ext4.

Thanks in advance,
	Sergey

PS I also looked at the contents of the  /var/lib/glusterd/vols/glvol/glvol-fuse.vol
and saw this, I don't know whether that's relevant or not
volume glvol-client-0
    type protocol/client
    option transport-type tcp
    option remote-subvolume /data/glvol/brick1/brick
    option remote-host node1
end-volume
...... volume glvol-stripe-0
    type cluster/stripe
subvolumes glvol-client-0 glvol-client-1 glvol-client-2 glvol-client-3 glvol-client-4 glvol-client-5 glvol-client-6 glvol-client-7
end-volume

volume glvol-dht
    type cluster/distribute
    subvolumes glvol-stripe-0
end-volume

volume glvol-write-behind
    type performance/write-behind
    subvolumes glvol-dht
end-volume

volume glvol-read-ahead
    type performance/read-ahead
    subvolumes glvol-write-behind
end-volume

volume glvol-io-cache
    type performance/io-cache
    subvolumes glvol-read-ahead
end-volume

volume glvol-quick-read
    type performance/quick-read
    subvolumes glvol-io-cache
end-volume

volume glvol-open-behind
    type performance/open-behind
    subvolumes glvol-quick-read
end-volume

volume glvol-md-cache
    type performance/md-cache
    subvolumes glvol-open-behind
end-volume

volume glvol
    type debug/io-stats
    option count-fop-hits off
    option latency-measurement off
    subvolumes glvol-md-cache
end-volume


*****************************************************
Sergey E. Koposov, PhD, Senior Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users




[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux