kernel parameters for improving gluster writes on millions of small writes (long)

hjmangalam at gmail.com (Harry Mangalam) · Thu, 26 Jul 2012 08:12:11 -0700

I had not, tho I had searched for something like this for a good bit
yesterday....(?!)  Back to google class for me..

Thanks very much!

hjm

On Thu, Jul 26, 2012 at 8:07 AM, John Mark Walker <johnmark at redhat.com> wrote:
> Harry,
>
> Have you seen this post?
>
> http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/
>
>
> Be sure and read all the comments, as Ben England chimes in on the comments, and he's one of the performance engineers at Red Hat.
>
> -JM
>
>
> ----- Harry Mangalam <hjmangalam at gmail.com> wrote:
>> This is a continuation of my previous posts about improving write perf
>> when trapping millions of small writes to a gluster filesystem.
>> I was able to improve write perf by ~30x by running STDOUT thru gzip
>> to consolidate and reduce the output stream.
>>
>> Today, another similar problem, having to do with yet another
>> bioinformatics program (which these days typically handle the 'short
>> reads' that come out of the majority of sequencing hardware, each read
>> being 30-150 characters, with some metadata typically in an ASCII file
>> containing millions of such entries).  Reading them doesn't seem to be
>> a problem (at least on our systems) but writing them is quite awful..
>>
>> The program is called 'art_illumina' from the Broad Inst's 'ALLPATHS'
>> suite and it generates an artificial Illumina data set from an input
>> genome.  In this case about 5GB of the type of data described above.
>> Like before, the gluster process goes to >100% and the program itself
>> slows to ~20-30% of a CPU.  In this case, the app's output cannot be
>> extrnally trapped by redirecting thru gzip since the output flag
>> specifies the base filename for 2 files that are created internally
>> and then written directly.  This prevents even setting up a named pipe
>> to trap and process the output.
>>
>> Since this gluster storage was set up specifically for bioinformatics,
>> this is a repeating problem and while some of the issues can be dealt
>> with by trapping and converting output, it would be VERY NICE if we
>> could deal with it at the OS level.
>>
>> The gluster volume is running over IPoIB on QDR IB and looks like this:
>> Volume Name: gl
>> Type: Distribute
>> Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
>> Status: Started
>> Number of Bricks: 8
>> Transport-type: tcp,rdma
>> Bricks:
>> Brick1: bs2:/raid1
>> Brick2: bs2:/raid2
>> Brick3: bs3:/raid1
>> Brick4: bs3:/raid2
>> Brick5: bs4:/raid1
>> Brick6: bs4:/raid2
>> Brick7: bs1:/raid1
>> Brick8: bs1:/raid2
>> Options Reconfigured:
>> performance.write-behind-window-size: 1024MB
>> performance.flush-behind: on
>> performance.cache-size: 268435456
>> nfs.disable: on
>> performance.io-cache: on
>> performance.quick-read: on
>> performance.io-thread-count: 64
>> auth.allow: 10.2.*.*,10.1.*.*
>>
>> I've tried to increase every caching option that might improve this
>> kind of performance, but it doesn't seem to help.  At this point, I'm
>> wondering whether changing the client (or server) kernel parameters
>> will help.
>>
>> The client's meminfo is:
>>  cat  /proc/meminfo
>> MemTotal:       529425924 kB
>> MemFree:        241833188 kB
>> Buffers:          355248 kB
>> Cached:         279699444 kB
>> SwapCached:            0 kB
>> Active:          2241580 kB
>> Inactive:       278287248 kB
>> Active(anon):     190988 kB
>> Inactive(anon):   287952 kB
>> Active(file):    2050592 kB
>> Inactive(file): 277999296 kB
>> Unevictable:       16856 kB
>> Mlocked:           16856 kB
>> SwapTotal:      563198732 kB
>> SwapFree:       563198732 kB
>> Dirty:              1656 kB
>> Writeback:             0 kB
>> AnonPages:        486876 kB
>> Mapped:            19808 kB
>> Shmem:               164 kB
>> Slab:            1475476 kB
>> SReclaimable:    1205944 kB
>> SUnreclaim:       269532 kB
>> KernelStack:        5928 kB
>> PageTables:        27312 kB
>> NFS_Unstable:          0 kB
>> Bounce:                0 kB
>> WritebackTmp:          0 kB
>> CommitLimit:    827911692 kB
>> Committed_AS:     536852 kB
>> VmallocTotal:   34359738367 kB
>> VmallocUsed:     1227732 kB
>> VmallocChunk:   33888774404 kB
>> HardwareCorrupted:     0 kB
>> AnonHugePages:    376832 kB
>> HugePages_Total:       0
>> HugePages_Free:        0
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:       2048 kB
>> DirectMap4k:      201088 kB
>> DirectMap2M:    15509504 kB
>> DirectMap1G:    521142272 kB
>>
>> and the server's meminfo is:
>>
>> $ cat  /proc/meminfo
>> MemTotal:       32861400 kB
>> MemFree:         1232172 kB
>> Buffers:           29116 kB
>> Cached:         30017272 kB
>> SwapCached:           44 kB
>> Active:         18840852 kB
>> Inactive:       11772428 kB
>> Active(anon):     492928 kB
>> Inactive(anon):    75264 kB
>> Active(file):   18347924 kB
>> Inactive(file): 11697164 kB
>> Unevictable:           0 kB
>> Mlocked:               0 kB
>> SwapTotal:      16382900 kB
>> SwapFree:       16382680 kB
>> Dirty:                 8 kB
>> Writeback:             0 kB
>> AnonPages:        566876 kB
>> Mapped:            14212 kB
>> Shmem:              1276 kB
>> Slab:             429164 kB
>> SReclaimable:     324752 kB
>> SUnreclaim:       104412 kB
>> KernelStack:        3528 kB
>> PageTables:        16956 kB
>> NFS_Unstable:          0 kB
>> Bounce:                0 kB
>> WritebackTmp:          0 kB
>> CommitLimit:    32813600 kB
>> Committed_AS:    3053096 kB
>> VmallocTotal:   34359738367 kB
>> VmallocUsed:      340196 kB
>> VmallocChunk:   34342345980 kB
>> HardwareCorrupted:     0 kB
>> AnonHugePages:    200704 kB
>> HugePages_Total:       0
>> HugePages_Free:        0
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:       2048 kB
>> DirectMap4k:        6656 kB
>> DirectMap2M:     2072576 kB
>> DirectMap1G:    31457280 kB
>>
>> Does this suggest any approach?  Is there a doc that suggests optimal
>> kernel parameters for gluster?
>>
>> I guess the only other option is to use the glusterfs as an NFS mount
>> and use the NFS client's caching..?  That will help on a single
>> process but decrease the overall cluster bandwidth considerably.
>>
>> --
>> Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
>> [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
>> 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
>> MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)