Re: Self-heal Problems with gluster and nfs

Norman Mähler <n.maehler@xxxxxxxxxxxxx> · Tue, 08 Jul 2014 13:30:49 +0200

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 08.07.2014 13:24, schrieb Pranith Kumar Karampuri:
> 
> On 07/08/2014 04:49 PM, Norman Mähler wrote:
> 
> 
> Am 08.07.2014 13:02, schrieb Pranith Kumar Karampuri:
>>>> On 07/08/2014 04:23 PM, Norman Mähler wrote: Of course:
>>>> 
>>>> The configuration is:
>>>> 
>>>> Volume Name: gluster_dateisystem Type: Replicate Volume ID: 
>>>> 2766695c-b8aa-46fd-b84d-4793b7ce847a Status: Started Number
>>>> of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 
>>>> filecluster1:/mnt/raid Brick2: filecluster2:/mnt/raid
>>>> Options Reconfigured: nfs.enable-ino32: on
>>>> performance.cache-size: 512MB diagnostics.brick-log-level:
>>>> WARNING diagnostics.client-log-level: WARNING
>>>> nfs.addr-namelookup: off performance.cache-refresh-timeout: 
>>>> 60 performance.cache-max-file-size: 100MB 
>>>> performance.write-behind-window-size: 10MB 
>>>> performance.io-thread-count: 18 performance.stat-prefetch:
>>>> off
>>>> 
>>>> 
>>>> The file count in xattrop is
>>>>> Do "gluster volume set gluster_dateisystem 
>>>>> cluster.self-heal-daemon off" This should stop all the
>>>>> entry self-heals and should also get the CPU usage low.
>>>>> When you don't have a lot of activity you can enable it
>>>>> again using "gluster volume set gluster_dateisystem
>>>>> cluster.self-heal-daemon on" If it doesn't get the CPU down
>>>>> execute "gluster volume set gluster_dateisystem
>>>>> cluster.entry-self-heal off". Let me know how it goes. 
>>>>> Pranith
> Thanks for your help so far but stopping the self heal deamon and
> the self heal machanism itself did not improve the situation.
> 
> Do you have further suggestions? Is it simply the load on the
> system? NFS could handle it easily before...
>> Is it at least a little better or no improvement at all?
> 
>> Pranith

There is a very small improvement of about 1 point in the 15 minute
load. The 15 minute load now is at about 20 to 22 at the moment.

Norman

> 
> Norman
> 
>>>> Brick 1: 2706 Brick 2: 2687
>>>> 
>>>> Norman
>>>> 
>>>> Am 08.07.2014 12:28, schrieb Pranith Kumar Karampuri:
>>>>>>> It seems like entry self-heal is happening. What is
>>>>>>> the volume configuration? Could you give ls 
>>>>>>> <brick-path>/.glusterfs/indices/xattrop | wc -l Count
>>>>>>> for all the bricks
>>>>>>> 
>>>>>>> Pranith On 07/08/2014 03:36 PM, Norman Mähler wrote:
>>>>>>>> Hello Pranith,
>>>>>>>> 
>>>>>>>> here are the logs. I only giv you the last 3000
>>>>>>>> lines, because the nfs.log from today is already 550
>>>>>>>> MB.
>>>>>>>> 
>>>>>>>> There are the standard files from a user home on the 
>>>>>>>> gluster system. All you normally find in a user
>>>>>>>> home. Config files, firefox and thunderbird files
>>>>>>>> etc.
>>>>>>>> 
>>>>>>>> Thanks in advance Norman
>>>>>>>> 
>>>>>>>> Am 08.07.2014 11:46, schrieb Pranith Kumar
>>>>>>>> Karampuri:
>>>>>>>>> On 07/08/2014 02:46 PM, Norman Mähler wrote: Hello 
>>>>>>>>> again,
>>>>>>>>> 
>>>>>>>>> i could resolve the self heal problems with the
>>>>>>>>> missing gfid files on one of the servers by
>>>>>>>>> deleting the gfid files on the other server.
>>>>>>>>> 
>>>>>>>>> They had a link count of 1 which means that the
>>>>>>>>> file on that the gfid pointed was already deleted.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> We have still these errors
>>>>>>>>> 
>>>>>>>>> [2014-07-08 09:09:43.564488] W 
>>>>>>>>> [client-rpc-fops.c:2469:client3_3_link_cbk] 
>>>>>>>>> 0-gluster_dateisystem-client-0: remote operation
>>>>>>>>> failed: File exists
>>>>>>>>> (00000000-0000-0000-0000-000000000000 -> 
>>>>>>>>> <gfid:b338b09e-2577-45b3-82bd-032f954dd083>/lock)
>>>>>>>>> 
>>>>>>>>> which appear in the glusterfshd.log and these
>>>>>>>>> 
>>>>>>>>> [2014-07-08 09:13:31.198462] E 
>>>>>>>>> [client-rpc-fops.c:5179:client3_3_inodelk] 
>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/replicate.so(+0x466b8)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>
>>>>>>>>> 
[0x7f5d29d4e6b8]
>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/replicate.so(afr_lock_blocking+0x844)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>
>>>>>>>>> 
[0x7f5d29d4e2e4]
>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/protocol/client.so(client_inodelk+0x99)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>
>>>>>>>>> 
[0x7f5d29f8b3c9]))) 0-: Assertion failed: 0
>>>>>>>>> from the nfs.log.
>>>>>>>>>> Could you attach mount (nfs.log) and brick logs
>>>>>>>>>> please. Do you have files with lots of
>>>>>>>>>> hard-links? Pranith
>>>>>>>>> I think the error messages belong together but I
>>>>>>>>> don't have any idea how to solve them.
>>>>>>>>> 
>>>>>>>>> Still we have got a very bad performance issue.
>>>>>>>>> The system load on the servers is above 20 and
>>>>>>>>> nearly no one is able to work in here on a
>>>>>>>>> client...
>>>>>>>>> 
>>>>>>>>> Hope for help Norman
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Am 07.07.2014 15:39, schrieb Pranith Kumar
>>>>>>>>> Karampuri:
>>>>>>>>>>>> On 07/07/2014 06:58 PM, Norman Mähler wrote:
>>>>>>>>>>>> Dear community,
>>>>>>>>>>>> 
>>>>>>>>>>>> we have got some serious problems with our
>>>>>>>>>>>> Gluster installation.
>>>>>>>>>>>> 
>>>>>>>>>>>> Here is the setting:
>>>>>>>>>>>> 
>>>>>>>>>>>> We have got 2 bricks (version 3.4.4) on a
>>>>>>>>>>>> debian 7.5, one of them with an nfs export.
>>>>>>>>>>>> There are about 120 clients connecting to the
>>>>>>>>>>>> exported nfs. These clients are thin clients
>>>>>>>>>>>> reading and writing their Linux home
>>>>>>>>>>>> directories from the exported nfs.
>>>>>>>>>>>> 
>>>>>>>>>>>> We want to change the access of these clients
>>>>>>>>>>>> one by one to access via gluster client.
>>>>>>>>>>>>> I did not understand what you meant by
>>>>>>>>>>>>> this. Are you moving to glusterfs-fuse
>>>>>>>>>>>>> based mounts?
>>>>>>>>>>>> Here are our problems:
>>>>>>>>>>>> 
>>>>>>>>>>>> In the moment we have got two types of error 
>>>>>>>>>>>> messages which come in burts to our 
>>>>>>>>>>>> glusterfshd.log
>>>>>>>>>>>> 
>>>>>>>>>>>> [2014-07-07 13:10:21.572487] W 
>>>>>>>>>>>> [client-rpc-fops.c:1538:client3_3_inodelk_cbk]
>>>>>>>>>>>>
>>>>>>>>>>>> 
0-gluster_dateisystem-client-1: remote operation
>>>>>>>>>>>> failed: No such file or directory
>>>>>>>>>>>> [2014-07-07 13:10:21.573448] W 
>>>>>>>>>>>> [client-rpc-fops.c:471:client3_3_open_cbk] 
>>>>>>>>>>>> 0-gluster_dateisystem-client-1: remote
>>>>>>>>>>>> operation failed: No such file or directory.
>>>>>>>>>>>> Path: 
>>>>>>>>>>>> <gfid:b0c4f78a-249f-4db7-9d5b-0902c7d8f6cc> 
>>>>>>>>>>>> (00000000-0000-0000-0000-000000000000)
>>>>>>>>>>>> [2014-07-07 13:10:21.573468] E 
>>>>>>>>>>>> [afr-self-heal-data.c:1270:afr_sh_data_open_cbk]
>>>>>>>>>>>>
>>>>>>>>>>>> 
0-gluster_dateisystem-replicate-0: open of
>>>>>>>>>>>> <gfid:b0c4f78a-249f-4db7-9d5b-0902c7d8f6cc>
>>>>>>>>>>>> failed on child gluster_dateisystem-client-1
>>>>>>>>>>>> (No such file or directory)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> This looks like a missing gfid file on one of
>>>>>>>>>>>> the bricks. I looked it up and yes the file
>>>>>>>>>>>> is missing on the second brick.
>>>>>>>>>>>> 
>>>>>>>>>>>> We got these messages the other way round,
>>>>>>>>>>>> too (missing on client-0 and the first
>>>>>>>>>>>> brick).
>>>>>>>>>>>> 
>>>>>>>>>>>> Is it possible to repair this one by copying
>>>>>>>>>>>> the gfid file to the brick where it was
>>>>>>>>>>>> missing? Or ist there another way to repair
>>>>>>>>>>>> it?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> The second message is
>>>>>>>>>>>> 
>>>>>>>>>>>> [2014-07-07 13:06:35.948738] W 
>>>>>>>>>>>> [client-rpc-fops.c:2469:client3_3_link_cbk] 
>>>>>>>>>>>> 0-gluster_dateisystem-client-1: remote
>>>>>>>>>>>> operation failed: File exists 
>>>>>>>>>>>> (00000000-0000-0000-0000-000000000000 -> 
>>>>>>>>>>>> <gfid:aae47250-8f69-480c-ac75-2da2f4d21d7a>/lock)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 
and I really do not know what to do with this
>>>>>>>>>>>> one...
>>>>>>>>>>>>> Did any of the bricks went offline and came
>>>>>>>>>>>>> back online? Pranith
>>>>>>>>>>>> I am really looking forward to your help
>>>>>>>>>>>> because this is an active system and the
>>>>>>>>>>>> system load on the nfs brick is about 25
>>>>>>>>>>>> (!!)
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks in advance! Norman Maehler
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>
>>>>>>>>>>>>> 
Gluster-users mailing list
>>>>>>>>>>>>> Gluster-users@xxxxxxxxxxx 
>>>>>>>>>>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>>>>>>>>>>>>> 
- -- Mit freundlichen Grüßen,
>>>> Norman Mähler
>>>> 
>>>> Bereichsleiter IT-Hochschulservice uni-assist e. V.
>>>> Geneststr. 5 Aufgang H, 3. Etage 10829 Berlin
>>>> 
>>>> Tel.: 030-66644382 n.maehler@xxxxxxxxxxxxx
>>>> 
> -- Mit freundlichen Grüßen,
> 
> Norman Mähler
> 
> Bereichsleiter IT-Hochschulservice uni-assist e. V. Geneststr. 5 
> Aufgang H, 3. Etage 10829 Berlin
> 
> Tel.: 030-66644382 n.maehler@xxxxxxxxxxxxx
> 

- -- 
Mit freundlichen Grüßen,

Norman Mähler

Bereichsleiter IT-Hochschulservice
uni-assist e. V.
Geneststr. 5
Aufgang H, 3. Etage
10829 Berlin

Tel.: 030-66644382
n.maehler@xxxxxxxxxxxxx
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTu9ZpAAoJEB810LSP8y+R6sQH/ieTOn6W8LheGswXcRJvHgSB
7BRjo3BxFrN/xa63EpRIXzdY+ScRwuNAp76z6/IZ+A/l3DrGQW2lxDXnvDB81CNW
2ergEJ4WuiC3x29tYHAj+A7DStiONz1qoH1v1VRsluHpPYOyhgQ6OKi6zWiFWllR
+gk3QfDOjpYaG0lQNHAci3pdBeg0uzYjaxhsMeMxq8T2NH0656++sx/vAW3XPyb6
Pkw7yDHuD4PKUOcyaR6QY7MrUPnVgSrlU1XTlLqwDyTR6erZqQHPBoaxG+Klm9vM
EFyi4MT8s/KE/fwlSh/EGP7+9CvRmNGilX2gPZoS/Y9ugrL+3c7jvFEosWgYCc4=
=roJ7
-----END PGP SIGNATURE-----
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users