Re: Self-heal Problems with gluster and nfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Am 08.07.2014 16:26, schrieb Pranith Kumar Karampuri:
> 
> On 07/08/2014 06:14 PM, Norman Mähler wrote:
> 
> 
> Am 08.07.2014 14:34, schrieb Pranith Kumar Karampuri:
>>>> On 07/08/2014 05:23 PM, Norman Mähler wrote:
>>>> 
>>>> 
>>>> Am 08.07.2014 13:24, schrieb Pranith Kumar Karampuri:
>>>>>>> On 07/08/2014 04:49 PM, Norman Mähler wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Am 08.07.2014 13:02, schrieb Pranith Kumar Karampuri:
>>>>>>>>>> On 07/08/2014 04:23 PM, Norman Mähler wrote: Of 
>>>>>>>>>> course:
>>>>>>>>>> 
>>>>>>>>>> The configuration is:
>>>>>>>>>> 
>>>>>>>>>> Volume Name: gluster_dateisystem Type: Replicate
>>>>>>>>>> Volume ID: 2766695c-b8aa-46fd-b84d-4793b7ce847a
>>>>>>>>>> Status: Started Number of Bricks: 1 x 2 = 2
>>>>>>>>>> Transport-type: tcp Bricks: Brick1:
>>>>>>>>>> filecluster1:/mnt/raid Brick2: 
>>>>>>>>>> filecluster2:/mnt/raid Options Reconfigured: 
>>>>>>>>>> nfs.enable-ino32: on performance.cache-size:
>>>>>>>>>> 512MB diagnostics.brick-log-level: WARNING 
>>>>>>>>>> diagnostics.client-log-level: WARNING 
>>>>>>>>>> nfs.addr-namelookup: off 
>>>>>>>>>> performance.cache-refresh-timeout: 60 
>>>>>>>>>> performance.cache-max-file-size: 100MB 
>>>>>>>>>> performance.write-behind-window-size: 10MB 
>>>>>>>>>> performance.io-thread-count: 18 
>>>>>>>>>> performance.stat-prefetch: off
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The file count in xattrop is
>>>>>>>>>>> Do "gluster volume set gluster_dateisystem 
>>>>>>>>>>> cluster.self-heal-daemon off" This should stop
>>>>>>>>>>> all the entry self-heals and should also get
>>>>>>>>>>> the CPU usage low. When you don't have a lot of
>>>>>>>>>>> activity you can enable it again using "gluster
>>>>>>>>>>> volume set gluster_dateisystem
>>>>>>>>>>> cluster.self-heal-daemon on" If it doesn't get
>>>>>>>>>>> the CPU down execute "gluster volume set
>>>>>>>>>>> gluster_dateisystem cluster.entry-self-heal
>>>>>>>>>>> off". Let me know how it goes. Pranith
>>>>>>> Thanks for your help so far but stopping the self heal
>>>>>>> deamon and the self heal machanism itself did not
>>>>>>> improve the situation.
>>>>>>> 
>>>>>>> Do you have further suggestions? Is it simply the load
>>>>>>> on the system? NFS could handle it easily before...
>>>>>>>> Is it at least a little better or no improvement at
>>>>>>>> all?
>>>> After waiting half an hour more the system load is falling 
>>>> steadily. At the moment it is around 10 which is not good but
>>>> a lot better than before. There are no messages in the
>>>> nfs.log and the glusterfshd.log anymore. In the brick log
>>>> there are still "inode not found - anonymous fd creation
>>>> failed" messages.
>>>>> They should go away once the heal is complete and the
>>>>> system is back to normal. I believe you have directories
>>>>> with lots of files? When can you start the healing process
>>>>> again (i.e. window where there won't be a lot of activity
>>>>> and you can afford the high CPU usage) so that things will
>>>>> be back to normal?
> We have got a window at night, but by now our admin decided to
> copy the files back to an nfs system, because even with diabled
> self heal our colleagues can not do their work with such a slow
> system.
>> This performance problem is addressed in 3.6 with a design change
>> in replication module in glusterfs.

Ok, this sounds good.
> 
> After that we may be able to start again with a new system. We are
> considering taking another network cluster sytem, but we are not
> quite sure what to do.
> 
>> Things should be smooth again after the self-heals are complete
>> IMO. What is the size of volume? How many files approximately? It
>> would be nice if you could give the complete logs at least later
>> to help in analyzing.

There are about 250 GB in approximately 650000 files on the volumes.
I will send you an additional Mail with links to the complete logs later.

Norman

> 
>> Pranith
> 
> 
> There are a lot of small files and lock files in these
> directories.
> 
> Norman
> 
> 
>>>>> Pranith
>>>> 
>>>> 
>>>> Norman
>>>> 
>>>>>>>> Pranith
>>>>>>> Norman
>>>>>>> 
>>>>>>>>>> Brick 1: 2706 Brick 2: 2687
>>>>>>>>>> 
>>>>>>>>>> Norman
>>>>>>>>>> 
>>>>>>>>>> Am 08.07.2014 12:28, schrieb Pranith Kumar
>>>>>>>>>> Karampuri:
>>>>>>>>>>>>> It seems like entry self-heal is happening.
>>>>>>>>>>>>> What is the volume configuration? Could you
>>>>>>>>>>>>> give ls 
>>>>>>>>>>>>> <brick-path>/.glusterfs/indices/xattrop |
>>>>>>>>>>>>> wc -l Count for all the bricks
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Pranith On 07/08/2014 03:36 PM, Norman
>>>>>>>>>>>>> Mähler wrote:
>>>>>>>>>>>>>> Hello Pranith,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> here are the logs. I only giv you the
>>>>>>>>>>>>>> last 3000 lines, because the nfs.log from
>>>>>>>>>>>>>> today is already 550 MB.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There are the standard files from a user
>>>>>>>>>>>>>> home on the gluster system. All you
>>>>>>>>>>>>>> normally find in a user home. Config
>>>>>>>>>>>>>> files, firefox and thunderbird files
>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks in advance Norman
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Am 08.07.2014 11:46, schrieb Pranith
>>>>>>>>>>>>>> Kumar Karampuri:
>>>>>>>>>>>>>>> On 07/08/2014 02:46 PM, Norman Mähler
>>>>>>>>>>>>>>> wrote: Hello again,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> i could resolve the self heal problems
>>>>>>>>>>>>>>> with the missing gfid files on one of
>>>>>>>>>>>>>>> the servers by deleting the gfid files
>>>>>>>>>>>>>>> on the other server.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> They had a link count of 1 which means
>>>>>>>>>>>>>>> that the file on that the gfid pointed
>>>>>>>>>>>>>>> was already deleted.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We have still these errors
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [2014-07-08 09:09:43.564488] W 
>>>>>>>>>>>>>>> [client-rpc-fops.c:2469:client3_3_link_cbk]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 
0-gluster_dateisystem-client-0: remote
>>>>>>>>>>>>>>> operation failed: File exists 
>>>>>>>>>>>>>>> (00000000-0000-0000-0000-000000000000
>>>>>>>>>>>>>>> -> 
>>>>>>>>>>>>>>> <gfid:b338b09e-2577-45b3-82bd-032f954dd083>/lock)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>> 
which appear in the glusterfshd.log and these
>>>>>>>>>>>>>>> [2014-07-08 09:13:31.198462] E 
>>>>>>>>>>>>>>> [client-rpc-fops.c:5179:client3_3_inodelk]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/replicate.so(+0x466b8)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
> [0x7f5d29d4e6b8]
>>>>>>>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/replicate.so(afr_lock_blocking+0x844)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>> 
[0x7f5d29d4e2e4]
>>>>>>>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/protocol/client.so(client_inodelk+0x99)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>> 
[0x7f5d29f8b3c9]))) 0-: Assertion failed: 0
>>>>>>>>>>>>>>> from the nfs.log.
>>>>>>>>>>>>>>>> Could you attach mount (nfs.log) and
>>>>>>>>>>>>>>>> brick logs please. Do you have files
>>>>>>>>>>>>>>>> with lots of hard-links? Pranith
>>>>>>>>>>>>>>> I think the error messages belong
>>>>>>>>>>>>>>> together but I don't have any idea how
>>>>>>>>>>>>>>> to solve them.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Still we have got a very bad
>>>>>>>>>>>>>>> performance issue. The system load on
>>>>>>>>>>>>>>> the servers is above 20 and nearly no
>>>>>>>>>>>>>>> one is able to work in here on a
>>>>>>>>>>>>>>> client...
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hope for help Norman
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Am 07.07.2014 15:39, schrieb Pranith
>>>>>>>>>>>>>>> Kumar Karampuri:
>>>>>>>>>>>>>>>>>> On 07/07/2014 06:58 PM, Norman
>>>>>>>>>>>>>>>>>> Mähler wrote: Dear community,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> we have got some serious problems
>>>>>>>>>>>>>>>>>> with our Gluster installation.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Here is the setting:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We have got 2 bricks (version
>>>>>>>>>>>>>>>>>> 3.4.4) on a debian 7.5, one of
>>>>>>>>>>>>>>>>>> them with an nfs export. There
>>>>>>>>>>>>>>>>>> are about 120 clients connecting
>>>>>>>>>>>>>>>>>> to the exported nfs. These 
>>>>>>>>>>>>>>>>>> clients are thin clients reading
>>>>>>>>>>>>>>>>>> and writing their Linux home
>>>>>>>>>>>>>>>>>> directories from the exported
>>>>>>>>>>>>>>>>>> nfs.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We want to change the access of
>>>>>>>>>>>>>>>>>> these clients one by one to
>>>>>>>>>>>>>>>>>> access via gluster client.
>>>>>>>>>>>>>>>>>>> I did not understand what you
>>>>>>>>>>>>>>>>>>> meant by this. Are you moving
>>>>>>>>>>>>>>>>>>> to glusterfs-fuse based
>>>>>>>>>>>>>>>>>>> mounts?
>>>>>>>>>>>>>>>>>> Here are our problems:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> In the moment we have got two
>>>>>>>>>>>>>>>>>> types of error messages which
>>>>>>>>>>>>>>>>>> come in burts to our
>>>>>>>>>>>>>>>>>> glusterfshd.log
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [2014-07-07 13:10:21.572487] W 
>>>>>>>>>>>>>>>>>> [client-rpc-fops.c:1538:client3_3_inodelk_cbk]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>>>> 
0-gluster_dateisystem-client-1: remote operation
>>>>>>>>>>>>>>>>>> failed: No such file or
>>>>>>>>>>>>>>>>>> directory [2014-07-07
>>>>>>>>>>>>>>>>>> 13:10:21.573448] W 
>>>>>>>>>>>>>>>>>> [client-rpc-fops.c:471:client3_3_open_cbk]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>>>> 
0-gluster_dateisystem-client-1: remote
>>>>>>>>>>>>>>>>>> operation failed: No such file
>>>>>>>>>>>>>>>>>> or directory. Path: 
>>>>>>>>>>>>>>>>>> <gfid:b0c4f78a-249f-4db7-9d5b-0902c7d8f6cc>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>>>> 
(00000000-0000-0000-0000-000000000000)
>>>>>>>>>>>>>>>>>> [2014-07-07 13:10:21.573468] E 
>>>>>>>>>>>>>>>>>> [afr-self-heal-data.c:1270:afr_sh_data_open_cbk]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>>>> 
0-gluster_dateisystem-replicate-0: open of
>>>>>>>>>>>>>>>>>> <gfid:b0c4f78a-249f-4db7-9d5b-0902c7d8f6cc>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>>>> 
failed on child gluster_dateisystem-client-1
>>>>>>>>>>>>>>>>>> (No such file or directory)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This looks like a missing gfid
>>>>>>>>>>>>>>>>>> file on one of the bricks. I
>>>>>>>>>>>>>>>>>> looked it up and yes the file is
>>>>>>>>>>>>>>>>>> missing on the second brick.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We got these messages the other
>>>>>>>>>>>>>>>>>> way round, too (missing on
>>>>>>>>>>>>>>>>>> client-0 and the first brick).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Is it possible to repair this one
>>>>>>>>>>>>>>>>>> by copying the gfid file to the
>>>>>>>>>>>>>>>>>> brick where it was missing? Or
>>>>>>>>>>>>>>>>>> ist there another way to repair
>>>>>>>>>>>>>>>>>> it?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The second message is
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [2014-07-07 13:06:35.948738] W 
>>>>>>>>>>>>>>>>>> [client-rpc-fops.c:2469:client3_3_link_cbk]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>>>> 
0-gluster_dateisystem-client-1: remote
>>>>>>>>>>>>>>>>>> operation failed: File exists 
>>>>>>>>>>>>>>>>>> (00000000-0000-0000-0000-000000000000
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 
- ->
>>>>>>>>>>>>>>>>>> <gfid:aae47250-8f69-480c-ac75-2da2f4d21d7a>/lock)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>>>> 
and I really do not know what to do with this
>>>>>>>>>>>>>>>>>> one...
>>>>>>>>>>>>>>>>>>> Did any of the bricks went
>>>>>>>>>>>>>>>>>>> offline and came back online?
>>>>>>>>>>>>>>>>>>> Pranith
>>>>>>>>>>>>>>>>>> I am really looking forward to
>>>>>>>>>>>>>>>>>> your help because this is an
>>>>>>>>>>>>>>>>>> active system and the system load
>>>>>>>>>>>>>>>>>> on the nfs brick is about 25
>>>>>>>>>>>>>>>>>> (!!)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks in advance! Norman
>>>>>>>>>>>>>>>>>> Maehler
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>>>>> 
Gluster-users mailing list
>>>>>>>>>>>>>>>>>>> Gluster-users@xxxxxxxxxxx 
>>>>>>>>>>>>>>>>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>>>>> 
- -- Mit freundlichen Grüßen,
>>>>>>>>>> Norman Mähler
>>>>>>>>>> 
>>>>>>>>>> Bereichsleiter IT-Hochschulservice uni-assist e.
>>>>>>>>>> V. Geneststr. 5 Aufgang H, 3. Etage 10829 Berlin
>>>>>>>>>> 
>>>>>>>>>> Tel.: 030-66644382 n.maehler@xxxxxxxxxxxxx
>>>>>>>>>> 
>>>>>>> -- Mit freundlichen Grüßen,
>>>>>>> 
>>>>>>> Norman Mähler
>>>>>>> 
>>>>>>> Bereichsleiter IT-Hochschulservice uni-assist e. V. 
>>>>>>> Geneststr. 5 Aufgang H, 3. Etage 10829 Berlin
>>>>>>> 
>>>>>>> Tel.: 030-66644382 n.maehler@xxxxxxxxxxxxx
>>>>>>> 
>>>> -- Mit freundlichen Grüßen,
>>>> 
>>>> Norman Mähler
>>>> 
>>>> Bereichsleiter IT-Hochschulservice uni-assist e. V.
>>>> Geneststr. 5 Aufgang H, 3. Etage 10829 Berlin
>>>> 
>>>> Tel.: 030-66644382 n.maehler@xxxxxxxxxxxxx
>>>> 
> -- Mit freundlichen Grüßen,
> 
> Norman Mähler
> 
> Bereichsleiter IT-Hochschulservice uni-assist e. V. Geneststr. 5 
> Aufgang H, 3. Etage 10829 Berlin
> 
> Tel.: 030-66644382 n.maehler@xxxxxxxxxxxxx
> 

- -- 
Mit freundlichen Grüßen,

Norman Mähler

Bereichsleiter IT-Hochschulservice
uni-assist e. V.
Geneststr. 5
Aufgang H, 3. Etage
10829 Berlin

Tel.: 030-66644382
n.maehler@xxxxxxxxxxxxx
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTvAijAAoJEB810LSP8y+Ri4MIAIk3rPF1FVVL/3R7Bp3lEpV/
5BPb3A8htSkg+Zq8udmWJBNauXyEJ3LOt4XsZIb/9GYZnD6wWWyIQJrq8cz3b67H
MXsTk4wnYzgCc8wDEPVjjz5UgRCA3rSoME1W8cZQmNfA3H9mLVBwh3/jQu9Av6LG
qpVkMPEwH6ln7xjh1UnzEJOWPmn45Q/shqo15fAMcredF7rXZ95u8awlfu9d6zR2
mxruBlnXTLe5xO+RHGR8hFfzS9eZI5XNhE8gz3bgRu0wiyShu4gt24GloxjwSx/N
G1/2vtNBBmabyISSlsjWlws0PjOznRzZcs8IFitQ1pE59sGCAUEVEr5HLyQkuQ0=
=N2gT
-----END PGP SIGNATURE-----
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users





[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux