----- Original Message ----- > Hello List, > > So I've been frustraded by intermittent performance problems throughout > January. The problem occurs on a two node setup running 3.4.5, 16 gigs > of ram with a bunch of local disk. For sometimes an hour for sometimes > weeks at a time (I have extensive graphs in OpenNMS) our Gluster boxes > will get their CPUs pegged, and in vmstat they'll show extremely high > numbers of context switches and interrupts. Eventually things calm > down. During this time, memory usage actually drops. Overall usage on > the box goes from between 6-10 gigs to right around 4 gigs, and stays > there. That's what really puzzles me. > > When performance is problematic, sar shows one device, the device > corresponding to the glusterfsd problem using all the CPU doing lots of > little reads, Sometimes 70k/second, very small avg rq size, say 10-12. > Afraid I don't have any saved output handy, but I can try to capture > some next time it happens. I have tons of information frankly, but am > trying to keep this reasonably brief. > > There are more than a dozen volumes on this two node setup. The CPU > usage is pretty much entirely contained to one volume, a 1.5 TB volume > that is just shy of 70% full. It stores uploaded files for a web app. > What I hate about this app and so am always suspicious of, is that it > stores a directory for every user in one level, so under the /data > directory in the volume, there are 450,000 sub directories at this > point. > > The only real mitigation step that's been taken so far was to turn off > the self-heal daemon on the volume, as I thought maybe crawling that > large directory was getting expensive. This doesn't seem to have done > anything as the problem still occurs. > > At this point I figure there are one of two things sorts of things > happening really broadly: one we're running into some sort of bug or > performance problem with gluster we should either fix perhaps by > upgrading or tuning around, or two, some process we're running but not > aware of is hammering the file system causing problems. > > If it's the latter option, can anyone give me any tips on figuring out > what might be hammering the system? I can use volume top to see what a > brick is doing, but I can't figure out how to tell what clients are > doing what. > > Apologies for the somewhat broad nature of the question, any input > thoughts would be much appreciated. I can certainly provide more info > about some things if it would help, but I've tried not to write a novel > here. Out of curiosity, are you able to test using GlusterFS 3.6.2? We've had a bunch of pretty in-depth upstream testing at decent scale (100+ nodes) from 3.5.x onwards, with lots of performance issues identified and fixed on the way through. So, I'm kinda hopeful the problem you're describing is fixed in newer releases. :D Regards and best wishes, Justin Clift -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users