Re: Self-heal Problems with gluster and nfs

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Tue, 08 Jul 2014 19:56:35 +0530

On 07/08/2014 06:14 PM, Norman Mähler wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 08.07.2014 14:34, schrieb Pranith Kumar Karampuri:
On 07/08/2014 05:23 PM, Norman Mähler wrote:

Am 08.07.2014 13:24, schrieb Pranith Kumar Karampuri:
On 07/08/2014 04:49 PM, Norman Mähler wrote:

Am 08.07.2014 13:02, schrieb Pranith Kumar Karampuri:
On 07/08/2014 04:23 PM, Norman Mähler wrote: Of
course:

The configuration is:

Volume Name: gluster_dateisystem Type: Replicate Volume
ID: 2766695c-b8aa-46fd-b84d-4793b7ce847a Status:
Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp
Bricks: Brick1: filecluster1:/mnt/raid Brick2:
filecluster2:/mnt/raid Options Reconfigured:
nfs.enable-ino32: on performance.cache-size: 512MB
diagnostics.brick-log-level: WARNING
diagnostics.client-log-level: WARNING
nfs.addr-namelookup: off
performance.cache-refresh-timeout: 60
performance.cache-max-file-size: 100MB
performance.write-behind-window-size: 10MB
performance.io-thread-count: 18
performance.stat-prefetch: off

The file count in xattrop is
Do "gluster volume set gluster_dateisystem
cluster.self-heal-daemon off" This should stop all
the entry self-heals and should also get the CPU
usage low. When you don't have a lot of activity you
can enable it again using "gluster volume set
gluster_dateisystem cluster.self-heal-daemon on" If
it doesn't get the CPU down execute "gluster volume
set gluster_dateisystem cluster.entry-self-heal off".
Let me know how it goes. Pranith
Thanks for your help so far but stopping the self heal deamon
and the self heal machanism itself did not improve the
situation.

Do you have further suggestions? Is it simply the load on
the system? NFS could handle it easily before...
Is it at least a little better or no improvement at all?
After waiting half an hour more the system load is falling
steadily. At the moment it is around 10 which is not good but a lot
better than before. There are no messages in the nfs.log and the
glusterfshd.log anymore. In the brick log there are still "inode
not found - anonymous fd creation failed" messages.
They should go away once the heal is complete and the system is
back to normal. I believe you have directories with lots of
files? When can you start the healing process again (i.e. window
where there won't be a lot of activity and you can afford the
high CPU usage) so that things will be back to normal?
We have got a window at night, but by now our admin decided to copy
the files back to an nfs system, because even with diabled self heal
our colleagues can not do their work with such a slow system.
This performance problem is addressed in 3.6 with a design change in 
replication module in glusterfs.

After that we may be able to start again with a new system.
We are considering taking another network cluster sytem, but we are
not quite sure what to do.

Things should be smooth again after the self-heals are complete IMO. 
What is the size of volume? How many files approximately?
It would be nice if you could give the complete logs at least later to 
help in analyzing.

Pranith

There are a lot of small files and lock files in these directories.

Norman

Pranith

Norman

Pranith
Norman

Brick 1: 2706 Brick 2: 2687

Norman

Am 08.07.2014 12:28, schrieb Pranith Kumar Karampuri:
It seems like entry self-heal is happening. What
is the volume configuration? Could you give ls
<brick-path>/.glusterfs/indices/xattrop | wc -l
Count for all the bricks

Pranith On 07/08/2014 03:36 PM, Norman Mähler
wrote:
Hello Pranith,

here are the logs. I only giv you the last
3000 lines, because the nfs.log from today is
already 550 MB.

There are the standard files from a user home
on the gluster system. All you normally find in
a user home. Config files, firefox and
thunderbird files etc.

Thanks in advance Norman

Am 08.07.2014 11:46, schrieb Pranith Kumar
Karampuri:
On 07/08/2014 02:46 PM, Norman Mähler wrote:
Hello again,

i could resolve the self heal problems with
the missing gfid files on one of the servers
by deleting the gfid files on the other
server.

They had a link count of 1 which means that
the file on that the gfid pointed was already
deleted.

We have still these errors

[2014-07-08 09:09:43.564488] W
[client-rpc-fops.c:2469:client3_3_link_cbk]
0-gluster_dateisystem-client-0: remote
operation failed: File exists
(00000000-0000-0000-0000-000000000000 ->
<gfid:b338b09e-2577-45b3-82bd-032f954dd083>/lock)

which appear in the glusterfshd.log and these
[2014-07-08 09:13:31.198462] E
[client-rpc-fops.c:5179:client3_3_inodelk]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/replicate.so(+0x466b8)

[0x7f5d29d4e6b8]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/replicate.so(afr_lock_blocking+0x844)

[0x7f5d29d4e2e4]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/protocol/client.so(client_inodelk+0x99)

[0x7f5d29f8b3c9]))) 0-: Assertion failed: 0
from the nfs.log.
Could you attach mount (nfs.log) and brick
logs please. Do you have files with lots
of hard-links? Pranith
I think the error messages belong together
but I don't have any idea how to solve them.

Still we have got a very bad performance
issue. The system load on the servers is
above 20 and nearly no one is able to work in
here on a client...

Hope for help Norman

Am 07.07.2014 15:39, schrieb Pranith Kumar
Karampuri:
On 07/07/2014 06:58 PM, Norman Mähler
wrote: Dear community,

we have got some serious problems with
our Gluster installation.

Here is the setting:

We have got 2 bricks (version 3.4.4) on
a debian 7.5, one of them with an nfs
export. There are about 120 clients
connecting to the exported nfs. These
clients are thin clients reading and
writing their Linux home directories
from the exported nfs.

We want to change the access of these
clients one by one to access via
gluster client.
I did not understand what you meant
by this. Are you moving to
glusterfs-fuse based mounts?
Here are our problems:

In the moment we have got two types of
error messages which come in burts to
our glusterfshd.log

[2014-07-07 13:10:21.572487] W
[client-rpc-fops.c:1538:client3_3_inodelk_cbk]

0-gluster_dateisystem-client-1: remote operation
failed: No such file or directory
[2014-07-07 13:10:21.573448] W
[client-rpc-fops.c:471:client3_3_open_cbk]

0-gluster_dateisystem-client-1: remote
operation failed: No such file or
directory. Path:
<gfid:b0c4f78a-249f-4db7-9d5b-0902c7d8f6cc>

(00000000-0000-0000-0000-000000000000)
[2014-07-07 13:10:21.573468] E
[afr-self-heal-data.c:1270:afr_sh_data_open_cbk]

0-gluster_dateisystem-replicate-0: open of
<gfid:b0c4f78a-249f-4db7-9d5b-0902c7d8f6cc>

failed on child gluster_dateisystem-client-1
(No such file or directory)

This looks like a missing gfid file on
one of the bricks. I looked it up and
yes the file is missing on the second
brick.

We got these messages the other way
round, too (missing on client-0 and the
first brick).

Is it possible to repair this one by
copying the gfid file to the brick
where it was missing? Or ist there
another way to repair it?

The second message is

[2014-07-07 13:06:35.948738] W
[client-rpc-fops.c:2469:client3_3_link_cbk]

0-gluster_dateisystem-client-1: remote
operation failed: File exists
(00000000-0000-0000-0000-000000000000
->
<gfid:aae47250-8f69-480c-ac75-2da2f4d21d7a>/lock)

and I really do not know what to do with this
one...
Did any of the bricks went offline
and came back online? Pranith
I am really looking forward to your
help because this is an active system
and the system load on the nfs brick is
about 25 (!!)

Thanks in advance! Norman Maehler

_______________________________________________

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users

- -- Mit freundlichen Grüßen,
Norman Mähler

Bereichsleiter IT-Hochschulservice uni-assist e. V.
Geneststr. 5 Aufgang H, 3. Etage 10829 Berlin

Tel.: 030-66644382 n.maehler@xxxxxxxxxxxxx

-- Mit freundlichen Grüßen,

Norman Mähler

Bereichsleiter IT-Hochschulservice uni-assist e. V.
Geneststr. 5 Aufgang H, 3. Etage 10829 Berlin

Tel.: 030-66644382 n.maehler@xxxxxxxxxxxxx

-- Mit freundlichen Grüßen,

Norman Mähler

Bereichsleiter IT-Hochschulservice uni-assist e. V. Geneststr. 5
Aufgang H, 3. Etage 10829 Berlin

Tel.: 030-66644382 n.maehler@xxxxxxxxxxxxx

- -- 
Mit freundlichen Grüßen,

Norman Mähler

Bereichsleiter IT-Hochschulservice
uni-assist e. V.
Geneststr. 5
Aufgang H, 3. Etage
10829 Berlin

Tel.: 030-66644382
n.maehler@xxxxxxxxxxxxx
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTu+fJAAoJEB810LSP8y+RidoH/i0MzPi4WNFdzlpv+U/Y+8a6
4g3FDGJWDg5+6jYHH884mSyfvVZqP0SR8Pcur/GgxjBzi8y8SnKbL5UNo1OqqFrP
Q31O5w5JXMyO2Xl3K7H05OJykLfAWn5vRWaS/f239a3KE7H0wZEuVHUA9v9EcxAi
cYrwRNce7skf1fPEifmReoDZzYTAPm+iYRzzQuTFKL9l/ky7dJbKq1/bl4xGk/40
CL8/+x/l7uBE1CaKXEGMsF1aOV/BDVyYOK+PfCaAva/+jc5eSRKl2dHfY1xGebTN
I4rm6Wy8qwMXQDlfWXfXZVsXEqfxISZrOQuJwDYQOxKRZPDRXYSITx5k6/JNOSY=
=hpyg
-----END PGP SIGNATURE-----

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users