Re: File operation failure on simple distributed volume

Mohammed Rafi K C <rkavunga@xxxxxxxxxx> · Mon, 9 Jan 2017 11:23:53 +0530

Is there any update on this ?

Regards

Rafi KC

On 12/24/2016 03:53 PM, yonex wrote:
> Rafi,
>
> Thanks again. I will try that and get back to you.
>
> Regards.
>
>
> 2016-12-23 18:03 GMT+09:00 Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
>> Hi Yonex,
>>
>> As we discussed in irc #gluster-devel , I have attached the gdb script
>> along with this mail.
>>
>> Procedure to run the gdb script.
>>
>> 1) Install gdb,
>>
>> 2) Download and install gluster debuginfo for your machine . packages
>> location --- > https://cbs.centos.org/koji/buildinfo?buildID=12757
>>
>> 3) find the process id and attach gdb to the process using the command
>> gdb attach <pid>  -x <path_to_script>
>>
>> 4) Continue running the script till you hit the problem
>>
>> 5) Stop the gdb
>>
>> 6) You will see a file called mylog.txt in the location where you ran
>> the gdb
>>
>>
>> Please keep an eye on the attached process. If you have any doubt please
>> feel free to revert me.
>>
>> Regards
>>
>> Rafi KC
>>
>>
>> On 12/19/2016 05:33 PM, Mohammed Rafi K C wrote:
>>> On 12/19/2016 05:32 PM, Mohammed Rafi K C wrote:
>>>> Client 0-glusterfs01-client-2 has disconnected from bricks around
>>>> 2016-12-15 11:21:17.854249 . Can you look and/or paste the brick logs
>>>> around the time.
>>> You can find the brick name and hostname for 0-glusterfs01-client-2 from
>>> client graph.
>>>
>>> Rafi
>>>
>>>> Are you there in any of gluster irc channel, if so Have you got a
>>>> nickname that I can search.
>>>>
>>>> Regards
>>>> Rafi KC
>>>>
>>>> On 12/19/2016 04:28 PM, yonex wrote:
>>>>> Rafi,
>>>>>
>>>>> OK. Thanks for your guide. I found the debug log and pasted lines around that.
>>>>> http://pastebin.com/vhHR6PQN
>>>>>
>>>>> Regards
>>>>>
>>>>>
>>>>> 2016-12-19 14:58 GMT+09:00 Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
>>>>>> On 12/16/2016 09:10 PM, yonex wrote:
>>>>>>> Rafi,
>>>>>>>
>>>>>>> Thanks, the .meta feature I didn't know is very nice. I finally have
>>>>>>> captured debug logs from a client and bricks.
>>>>>>>
>>>>>>> A mount log:
>>>>>>> - http://pastebin.com/Tjy7wGGj
>>>>>>>
>>>>>>> FYI rickdom126 is my client's hostname.
>>>>>>>
>>>>>>> Brick logs around that time:
>>>>>>> - Brick1: http://pastebin.com/qzbVRSF3
>>>>>>> - Brick2: http://pastebin.com/j3yMNhP3
>>>>>>> - Brick3: http://pastebin.com/m81mVj6L
>>>>>>> - Brick4: http://pastebin.com/JDAbChf6
>>>>>>> - Brick5: http://pastebin.com/7saP6rsm
>>>>>>>
>>>>>>> However I could not find any message like "EOF on socket". I hope
>>>>>>> there is any helpful information in the logs above.
>>>>>> Indeed. I understand that the connections are in disconnected state. But
>>>>>> what particularly I'm looking for is the cause of the disconnect, Can
>>>>>> you paste the debug logs when it start disconnects, and around that. You
>>>>>> may see a debug logs that says "disconnecting now".
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Rafi KC
>>>>>>
>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>>
>>>>>>> 2016-12-14 15:20 GMT+09:00 Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
>>>>>>>> On 12/13/2016 09:56 PM, yonex wrote:
>>>>>>>>> Hi Rafi,
>>>>>>>>>
>>>>>>>>> Thanks for your response. OK, I think it is possible to capture debug
>>>>>>>>> logs, since the error seems to be reproduced a few times per day. I
>>>>>>>>> will try that. However, so I want to avoid redundant debug outputs if
>>>>>>>>> possible, is there a way to enable debug log only on specific client
>>>>>>>>> nodes?
>>>>>>>> if you are using fuse mount, there is proc kind of feature called .meta
>>>>>>>> . You can set log level through that for a particular client [1] . But I
>>>>>>>> also want log from bricks because I suspect bricks process for
>>>>>>>> initiating the disconnects.
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] eg : echo 8 > /mnt/glusterfs/.meta/logging/loglevel
>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Yonex
>>>>>>>>>
>>>>>>>>> 2016-12-13 23:33 GMT+09:00 Mohammed Rafi K C <rkavunga@xxxxxxxxxx>:
>>>>>>>>>> Hi Yonex,
>>>>>>>>>>
>>>>>>>>>> Is this consistently reproducible ? if so, Can you enable debug log [1]
>>>>>>>>>> and check for any message similar to [2]. Basically you can even search
>>>>>>>>>> for "EOF on socket".
>>>>>>>>>>
>>>>>>>>>> You can set your log level back to default (INFO) after capturing for
>>>>>>>>>> some time.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] : gluster volume set <volname> diagnostics.brick-log-level DEBUG and
>>>>>>>>>> gluster volume set <volname> diagnostics.client-log-level DEBUG
>>>>>>>>>>
>>>>>>>>>> [2] : http://pastebin.com/xn8QHXWa
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Rafi KC
>>>>>>>>>>
>>>>>>>>>> On 12/12/2016 09:35 PM, yonex wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> When my application moves a file from it's local disk to FUSE-mounted
>>>>>>>>>>> GlusterFS volume, the client outputs many warnings and errors not
>>>>>>>>>>> always but occasionally. The volume is a simple distributed volume.
>>>>>>>>>>>
>>>>>>>>>>> A sample of logs pasted: http://pastebin.com/axkTCRJX
>>>>>>>>>>>
>>>>>>>>>>> It seems to come from something like a network disconnection
>>>>>>>>>>> ("Transport endpoint is not connected") at a glance, but other
>>>>>>>>>>> networking applications on the same machine don't observe such a
>>>>>>>>>>> thing. So I guess there may be a problem somewhere in GlusterFS stack.
>>>>>>>>>>>
>>>>>>>>>>> It ended in failing to rename a file, logging PHP Warning like below:
>>>>>>>>>>>
>>>>>>>>>>>     PHP Warning:  rename(/glusterfs01/db1/stack/f0/13a9a2f0): failed
>>>>>>>>>>> to open stream: Input/output error in [snipped].php on line 278
>>>>>>>>>>>     PHP Warning:
>>>>>>>>>>> rename(/var/stack/13a9a2f0,/glusterfs01/db1/stack/f0/13a9a2f0):
>>>>>>>>>>> Input/output error in [snipped].php on line 278
>>>>>>>>>>>
>>>>>>>>>>> Conditions:
>>>>>>>>>>>
>>>>>>>>>>> - GlusterFS 3.8.5 installed via yum CentOS-Gluster-3.8.repo
>>>>>>>>>>> - Volume info and status pasted: http://pastebin.com/JPt2KeD8
>>>>>>>>>>> - Client machines' OS: Scientific Linux 6 or CentOS 6.
>>>>>>>>>>> - Server machines' OS: CentOS 6.
>>>>>>>>>>> - Kernel version is 2.6.32-642.6.2.el6.x86_64 on all machines.
>>>>>>>>>>> - The number of connected FUSE clients is 260.
>>>>>>>>>>> - No firewall between connected machines.
>>>>>>>>>>> - Neither remounting volumes nor rebooting client machines take effect.
>>>>>>>>>>> - It is caused by not only rename() but also copy() and filesize() operation.
>>>>>>>>>>> - No outputs in brick logs when it happens.
>>>>>>>>>>>
>>>>>>>>>>> Any ideas? I'd appreciate any help.
>>>>>>>>>>>
>>>>>>>>>>> Regards.
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>>> Gluster-users@xxxxxxxxxxx
>>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users