This is a long-standing problem for me, and I’m wondering how to insulate myself from it…pardon the long-windedness in advance.
I use gluster internationally as regional repositories of files, and it’s pretty constantly being rsync’d to (ie, written to solely by rsync, optimized with –inplace or similar).
These regional repositories are also being read from, each to the tune of 10-50MB/s. Each gluster pool is anywhere between 4 to 16 servers, each with one brick of RAID6, all pools in a distributed-only config. I’m not currently using distributed-replicated, but even that configuration is not immune to my problem.
So, here’s the problem:
If one disk on one gluster brick experiences timeouts, all the gluster clients block. This is likely because the rate at which the disks are being exercised by rsyncs (writes and stats) plus reads (client file access) causes an overwhelming backlog of gluster ops, something perhaps is bottlenecked and locking up, but in general it’s fairly useless to me. Running a ‘df’ hangs completely.
This has been an issue for me for years. My usual procedure is to manually fail the disk that’s experiencing timeouts, if it hasn’t been ejected already by the raid controller, and remove the load from the gluster file system—it only takes a fraction of a minute before the gluster volume recovers and I can add the load back. Rebuilding parity to the brick’s raid is not the problem—it’s the moments before the disk ultimately fails that causes the backlog of requests that really causes problems.
I’m looking for advice as to how to insulate myself from this problem better. My RAID cards don’t support modifying disk timeouts to be incredibly short. I can see disk timeout messages from the raid card, and write an omprog function to fail the disk, but that’s kinda brutal. Maybe I could get a different raid card that supports shorter timeouts or fast disk failures, but if anyone has experience with, say md raid1 not having this problem, or something similar, it might be worth the expense to go that route.
If my memory is correct, gluster still has this problem with a distributed-replicated configuration, because writes need to succeed on both leafs before an operation is considered complete, so a timeout on one node is still detrimental.
Insight, experience designing around this, tunables I haven’t considered—I’ll take anything. I really like gluster, I’ll keep using it, but this is its Achille’s heel for me. Is there a magic bullet? Or do I just need to fail faster?