Re: gluster becomes too slow, need frequent stop-start or reboot

Vijay Bellur <vbellur@xxxxxxxxxx> · Mon, 25 Jun 2018 23:41:40 -0700

On Mon, Jun 25, 2018 at 10:01 AM Anh Vo <vtqanh@xxxxxxxxx> wrote:
Anyone able to help us troubleshoot this issue? This is getting worse. We are back to our 3-replica setup but the issue is still happening. What we have found is that if I just bring one set of bricks offline. For example if I have (0 1 2) (3 4 5) (6 7 8) (9 10 11) and if I take the bricks 0 3 6 9, or bricks 1 4 7 10 offline then performance is super fast. The moment all bricks are online things become very slow. It seems like gluster is having some sort of lock contention between its members. During the period of slowness gluster profile would show excessive time spent in LOOKUP, FINODELK

Have you checked if a self-heal is in progress to resync data after the bricks are all online? Healing can impact performance of user applications owing to contention and once the system reaches a steady state, the performance should improve.

     11.60     752.64 us      10.00 us 2647757.00 us      272476323      LOOKUP
     15.83    6884.12 us      29.00 us 2190470.00 us       40626259       WRITE
     27.84   80480.22 us      40.00 us 11731910.00 us        6114072    FXATTROP
     37.83  105125.18 us      12.00 us 276088722.00 us        6359515    FINODELK

We have about one or two months before we need to make a decision to keep Gluster and so far it has been a lot of headache.

Detailed bug reports, RFEs in github and/or patches that can help Gluster work better for your use case are welcome!

Thanks,
Vijay

On Thu, Jun 14, 2018 at 10:18 AM, Anh Vo <vtqanh@xxxxxxxxx> wrote:
Our gluster keeps getting to a state where it becomes painfully slow and many of our applications time out on read/write call. When this happens a simple ls at top level directory from the mount takes somewhere between 8-25s (normally it is very fast, at most 1-2s). The top level directory only has about 10 folders.
The two methods to mitigate this problem have been 1) restart all GFS servers or 2) stop/start the volume. 2) does take somewhere between half an hour to an hour for gluster to get back to its desired performance.

So far the logs don't show anything unusual but perhaps I don't know what I should be looking for in the logs. Even when gluster are fully functional we see lots of logs, hard to tell which error is harmless and what is not. 

This issue does not seem to happen with our 3 replica glusters, only with 2-replica-1-arbiter and 2-replica. However, our 3-replica glusters are only 30% full while the 2-replica ones are about 80% full.
We're running 3.12.9 for the servers. The clients are 3.8.15, but we notice the slowness of operations on 3.12.9 clients as well.

Configuration: 12 GFS servers, one brick per server, replica 2, 80T each brick. We used to have arbiters but thought the arbiters were causing the slow down so we took them out. Apparently it's not the case.

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users