Re: File operation failure on simple distributed volume

yonex <yonexyonex@xxxxxxxxxx> · Tue, 07 Mar 2017 18:23:28 +0900

Hi Rafi

Sorry for the late. Though I eventually could not have reproduced the
problem out of the production environment, I will be able to run the
debug build as part of the production if it does not occur a
performance issue. I would like you to give me a guide about the debug
build. By the way, before that, as it would be helpful to update
glusterfs from 3.8.5 to 3.8.9, I am going to do this.

Regards

Yonex

2017-02-17 15:03 GMT+09:00 Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
> Hi Yonex
>
> Recently Poornima has fixed one corruption issue with upcall, which
> seems unlikely the cause of the issue, given that you are running fuse
> clients. Even then I would like to give you a debug build including the
> fix [1] and adding additional logs.
>
> Will you be able to run the debug build ?
>
>
> [1] : https://review.gluster.org/#/c/16613/
>
> Regards
>
> Rafi KC
>
>
> On 02/16/2017 09:13 PM, yonex wrote:
>> Hi Rafi,
>>
>> I'm still on this issue. But reproduction has not yet been achieved
>> outside of production. In production environment, I have made
>> applications stop writing data to glusterfs volume. Only read
>> operations are going.
>>
>> P.S. It seems that I have corrupted the email thread..;-(
>> http://lists.gluster.org/pipermail/gluster-users/2017-January/029679.html
>>
>> 2017-02-14 17:19 GMT+09:00 Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
>>> Hi Yonex,
>>>
>>> Are you still hitting this issue ?
>>>
>>>
>>> Regards
>>>
>>> Rafi KC
>>>
>>>
>>> On 01/16/2017 10:36 AM, yonex wrote:
>>>
>>> Hi
>>>
>>> I noticed that there is a high throughput degradation while attaching the
>>> gdb script to a glusterfs client process. Write speed becomes 2% or less. It
>>> is not be able to keep thrown in production.
>>>
>>> Could you provide the custom build that you mentioned before? I am going to
>>> keep trying to reproduce the problem outside of the production environment.
>>>
>>> Regards
>>>
>>> 2017年1月8日 21:54、Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
>>>
>>> Is there any update on this ?
>>>
>>>
>>> Regards
>>>
>>> Rafi KC
>>>
>>> On 12/24/2016 03:53 PM, yonex wrote:
>>>
>>> Rafi,
>>>
>>>
>>> Thanks again. I will try that and get back to you.
>>>
>>>
>>> Regards.
>>>
>>>
>>>
>>> 2016-12-23 18:03 GMT+09:00 Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
>>>
>>> Hi Yonex,
>>>
>>>
>>> As we discussed in irc #gluster-devel , I have attached the gdb script
>>>
>>> along with this mail.
>>>
>>>
>>> Procedure to run the gdb script.
>>>
>>>
>>> 1) Install gdb,
>>>
>>>
>>> 2) Download and install gluster debuginfo for your machine . packages
>>>
>>> location --- > https://cbs.centos.org/koji/buildinfo?buildID=12757
>>>
>>>
>>> 3) find the process id and attach gdb to the process using the command
>>>
>>> gdb attach <pid> -x <path_to_script>
>>>
>>>
>>> 4) Continue running the script till you hit the problem
>>>
>>>
>>> 5) Stop the gdb
>>>
>>>
>>> 6) You will see a file called mylog.txt in the location where you ran
>>>
>>> the gdb
>>>
>>>
>>>
>>> Please keep an eye on the attached process. If you have any doubt please
>>>
>>> feel free to revert me.
>>>
>>>
>>> Regards
>>>
>>>
>>> Rafi KC
>>>
>>>
>>>
>>> On 12/19/2016 05:33 PM, Mohammed Rafi K C wrote:
>>>
>>> On 12/19/2016 05:32 PM, Mohammed Rafi K C wrote:
>>>
>>> Client 0-glusterfs01-client-2 has disconnected from bricks around
>>>
>>> 2016-12-15 11:21:17.854249 . Can you look and/or paste the brick logs
>>>
>>> around the time.
>>>
>>> You can find the brick name and hostname for 0-glusterfs01-client-2 from
>>>
>>> client graph.
>>>
>>>
>>> Rafi
>>>
>>>
>>> Are you there in any of gluster irc channel, if so Have you got a
>>>
>>> nickname that I can search.
>>>
>>>
>>> Regards
>>>
>>> Rafi KC
>>>
>>>
>>> On 12/19/2016 04:28 PM, yonex wrote:
>>>
>>> Rafi,
>>>
>>>
>>> OK. Thanks for your guide. I found the debug log and pasted lines around
>>> that.
>>>
>>> http://pastebin.com/vhHR6PQN
>>>
>>>
>>> Regards
>>>
>>>
>>>
>>> 2016-12-19 14:58 GMT+09:00 Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
>>>
>>> On 12/16/2016 09:10 PM, yonex wrote:
>>>
>>> Rafi,
>>>
>>>
>>> Thanks, the .meta feature I didn't know is very nice. I finally have
>>>
>>> captured debug logs from a client and bricks.
>>>
>>>
>>> A mount log:
>>>
>>> - http://pastebin.com/Tjy7wGGj
>>>
>>>
>>> FYI rickdom126 is my client's hostname.
>>>
>>>
>>> Brick logs around that time:
>>>
>>> - Brick1: http://pastebin.com/qzbVRSF3
>>>
>>> - Brick2: http://pastebin.com/j3yMNhP3
>>>
>>> - Brick3: http://pastebin.com/m81mVj6L
>>>
>>> - Brick4: http://pastebin.com/JDAbChf6
>>>
>>> - Brick5: http://pastebin.com/7saP6rsm
>>>
>>>
>>> However I could not find any message like "EOF on socket". I hope
>>>
>>> there is any helpful information in the logs above.
>>>
>>> Indeed. I understand that the connections are in disconnected state. But
>>>
>>> what particularly I'm looking for is the cause of the disconnect, Can
>>>
>>> you paste the debug logs when it start disconnects, and around that. You
>>>
>>> may see a debug logs that says "disconnecting now".
>>>
>>>
>>>
>>> Regards
>>>
>>> Rafi KC
>>>
>>>
>>>
>>> Regards.
>>>
>>>
>>>
>>> 2016-12-14 15:20 GMT+09:00 Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
>>>
>>> On 12/13/2016 09:56 PM, yonex wrote:
>>>
>>> Hi Rafi,
>>>
>>>
>>> Thanks for your response. OK, I think it is possible to capture debug
>>>
>>> logs, since the error seems to be reproduced a few times per day. I
>>>
>>> will try that. However, so I want to avoid redundant debug outputs if
>>>
>>> possible, is there a way to enable debug log only on specific client
>>>
>>> nodes?
>>>
>>> if you are using fuse mount, there is proc kind of feature called .meta
>>>
>>> . You can set log level through that for a particular client [1] . But I
>>>
>>> also want log from bricks because I suspect bricks process for
>>>
>>> initiating the disconnects.
>>>
>>>
>>>
>>> [1] eg : echo 8 > /mnt/glusterfs/.meta/logging/loglevel
>>>
>>>
>>> Regards
>>>
>>>
>>> Yonex
>>>
>>>
>>> 2016-12-13 23:33 GMT+09:00 Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
>>>
>>> Hi Yonex,
>>>
>>>
>>> Is this consistently reproducible ? if so, Can you enable debug log [1]
>>>
>>> and check for any message similar to [2]. Basically you can even search
>>>
>>> for "EOF on socket".
>>>
>>>
>>> You can set your log level back to default (INFO) after capturing for
>>>
>>> some time.
>>>
>>>
>>>
>>> [1] : gluster volume set <volname> diagnostics.brick-log-level DEBUG and
>>>
>>> gluster volume set <volname> diagnostics.client-log-level DEBUG
>>>
>>>
>>> [2] : http://pastebin.com/xn8QHXWa
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>> Rafi KC
>>>
>>>
>>> On 12/12/2016 09:35 PM, yonex wrote:
>>>
>>> Hi,
>>>
>>>
>>> When my application moves a file from it's local disk to FUSE-mounted
>>>
>>> GlusterFS volume, the client outputs many warnings and errors not
>>>
>>> always but occasionally. The volume is a simple distributed volume.
>>>
>>>
>>> A sample of logs pasted: http://pastebin.com/axkTCRJX
>>>
>>>
>>> It seems to come from something like a network disconnection
>>>
>>> ("Transport endpoint is not connected") at a glance, but other
>>>
>>> networking applications on the same machine don't observe such a
>>>
>>> thing. So I guess there may be a problem somewhere in GlusterFS stack.
>>>
>>>
>>> It ended in failing to rename a file, logging PHP Warning like below:
>>>
>>>
>>> PHP Warning: rename(/glusterfs01/db1/stack/f0/13a9a2f0): failed
>>>
>>> to open stream: Input/output error in [snipped].php on line 278
>>>
>>> PHP Warning:
>>>
>>> rename(/var/stack/13a9a2f0,/glusterfs01/db1/stack/f0/13a9a2f0):
>>>
>>> Input/output error in [snipped].php on line 278
>>>
>>>
>>> Conditions:
>>>
>>>
>>> - GlusterFS 3.8.5 installed via yum CentOS-Gluster-3.8.repo
>>>
>>> - Volume info and status pasted: http://pastebin.com/JPt2KeD8
>>>
>>> - Client machines' OS: Scientific Linux 6 or CentOS 6.
>>>
>>> - Server machines' OS: CentOS 6.
>>>
>>> - Kernel version is 2.6.32-642.6.2.el6.x86_64 on all machines.
>>>
>>> - The number of connected FUSE clients is 260.
>>>
>>> - No firewall between connected machines.
>>>
>>> - Neither remounting volumes nor rebooting client machines take effect.
>>>
>>> - It is caused by not only rename() but also copy() and filesize()
>>> operation.
>>>
>>> - No outputs in brick logs when it happens.
>>>
>>>
>>> Any ideas? I'd appreciate any help.
>>>
>>>
>>> Regards.
>>>
>>> _______________________________________________
>>>
>>> Gluster-users mailing list
>>>
>>> Gluster-users@xxxxxxxxxxx
>>>
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users