This is really good to know, as we've started to receive interest from a fair number of scientists in your world. Could you do me a favor and write this up in a Q&A format at http://community.gluster.org/ ? -JM ----- Harry Mangalam <hjmangalam at gmail.com> wrote: > The problem described in the subject appears NOT to be the case. It's > not that simultaneous reads and writes dramatically decrease perf, but > that the type of /writes/ being done by this app (bedtools) kills > performance. If this was a self-writ app or an infrequently used one, > I wouldn't bother writing this up, but bedtools is a fairly popular > genomics app and since many installations use gluster to host Next-Gen > sequencing data and analysis, I thought I'd follow up on my own post. > > The short version: > ============= > Insert gzip to compress and stream the data before sending it to > gluster fs. The improvement in IO (and application) performance is > dramatic. > > ie (all files on a gluster fs) > > genomeCoverageBed -ibam RS_11261.bam -g \ > ref/dmel-all-chromosome-r5.1.fasta -d |gzip > output.cov.gz > > inserting the '| gzip' increased the app speed by more than 30X > (relative to not using it on a gluster fs; however it even improved > the wall clock speed of the app relative to running on a local > filesystem by about 1/3), decreased the gluster CPU utilization by > ~99% and reduced the output size by 80%. So, wins all round. > > > The long version: > ============ > The type of writes that bedtools does is also fairly common - lots > of writes of tiny amounts of data. > > As I understand it (which may be wrong; please correct) the gluster > native client (which we're using) does not buffer IO as well as the > NFS client, which is why we frequently see complaints about gluster vs > NFS perf. > The apparent problem for bedtools is that these zillions of tiny > writes are being handled separately or at least not cached well to be > consolidated in a large write. To present the data to gluster as a > continuous stream instead of these tiny writes, they have to be > 'converted' to such a stream. gzip is a nice solution because it > compresses as it converts. Aparently anything that takes STDIN, > buffers it appropriately and then spits it out on STDOUT will work. > Even piping the data thru 'cat' will work to allow bedtools to > continue to run at 100%, tho it will increase the gluster CPU > utilization to >90%. 'cat' of course uses less CPU (~14%) while gzip > will use more (~60%) tho decreasing gluster;s use enormously. > > I did try the performance options I mentioned earlier: > > performance.write-behind-window-size: 1024MB > performance.flush-behind: on > > They did not seem to help at all and I'd still like an explanation > of what they're supposed to do. > > The upshot is that this seems like, if not a bug, then at least an > opportunity to improve gluster perfomance considerably. > > -- > Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine > [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 > 415 South Circle View Dr, Irvine, CA, 92697 [shipping] > MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users