On 12/13/2013 02:00 PM, Alex Chekholko wrote:
My best guess is that you overloaded your interconnect. Do you have
metrics for if/when your network was saturated? That would cause
Gluster clients to time out.
My best guess is that you went into the "E" state of your "USE
(Utilization, Saturation, Error)" spectrum.
IME, that is a common pattern for out Lustre/GPFS clients, you get all
kinds of weird error states if you manage to saturate your I/O for an
extended period of time and fill all of the buffers everywhere.
When we tried to roll out GlusterFS for a production environment a few
years ago, we ran into exactly this problem. Our scenario was a
multi-master cluster, and the worst part appeared to be log files. Any
time a host wrote to a log file it had to synchronize the log file. And
since there were multiple masters, this very quickly clogged our
interconnect and ended things.
We ended up rolling back GlusterFS for this purpose and moved to a
distributed, asynchronous logging system rolled in house that used Linux
kernel message queues, with the understanding that replicated log files
would see a small amount of jitter and out-of-order appearance between
hosts. While this may sound irreverent, all log entries have a time
stamp anyway so it's all good and has worked well for us.
It may be that this has been fixed recently, but it's a use case I
thought might warrant consideration.
-Ben
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users