Hello List,
When performance is problematic, sar shows one device, the device corresponding to the glusterfsd problem using all the CPU doing lots of little reads, Sometimes 70k/second, very small avg rq size, say 10-12. Afraid I don't have any saved output handy, but I can try to capture some next time it happens. I have tons of information frankly, but am trying to keep this reasonably brief.
There are more than a dozen volumes on this two node setup. The CPU usage is pretty much entirely contained to one volume, a 1.5 TB volume that is just shy of 70% full. It stores uploaded files for a web app. What I hate about this app and so am always suspicious of, is that it stores a directory for every user in one level, so under the /data directory in the volume, there are 450,000 sub directories at this point.
The only real mitigation step that's been taken so far was to turn off the self-heal daemon on the volume, as I thought maybe crawling that large directory was getting expensive. This doesn't seem to have done anything as the problem still occurs.
At this point I figure there are one of two things sorts of things happening really broadly: one we're running into some sort of bug or performance problem with gluster we should either fix perhaps by upgrading or tuning around, or two, some process we're running but not aware of is hammering the file system causing problems.
If it's the latter option, can anyone give me any tips on figuring out what might be hammering the system? I can use volume top to see what a brick is doing, but I can't figure out how to tell what clients are doing what.
Apologies for the somewhat broad nature of the question, any input thoughts would be much appreciated. I can certainly provide more info about some things if it would help, but I've tried not to write a novel here.
Thanks,
-Matt
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users