Re: Replacing a node in a 4x2 distributed/replicated setup

Atin Mukherjee <atin.mukherjee83@xxxxxxxxx> · Fri, 30 Oct 2015 22:26:36 +0530

This could very well be related to op-version. Could you look at the faulty node's glusterd log and see the error log entries, that would give us the exact reason of failure.
-Atin

Sent from one plus one
On Oct 30, 2015 5:35 PM, "Thomas Bätzler" <t.baetzler@xxxxxxxxxx> wrote:
Hi,

can somebody help me with fixing our 8 node gluster please?

Setup is as follows:

root@glucfshead2:~# gluster volume info

Volume Name: archive

Type: Distributed-Replicate

Volume ID: d888b302-2a35-4559-9bb0-4e182f49f9c6

Status: Started

Number of Bricks: 4 x 2 = 8

Transport-type: tcp

Bricks:

Brick1: glucfshead1:/data/glusterfs/archive/brick1

Brick2: glucfshead5:/data/glusterfs/archive/brick1

Brick3: glucfshead2:/data/glusterfs/archive/brick1

Brick4: glucfshead6:/data/glusterfs/archive/brick1

Brick5: glucfshead3:/data/glusterfs/archive/brick1

Brick6: glucfshead7:/data/glusterfs/archive/brick1

Brick7: glucfshead4:/data/glusterfs/archive/brick1

Brick8: glucfshead8:/data/glusterfs/archive/brick1

Options Reconfigured:

cluster.data-self-heal: off

cluster.entry-self-heal: off

cluster.metadata-self-heal: off

features.lock-heal: on

cluster.readdir-optimize: on

performance.flush-behind: off

performance.io-thread-count: 16

features.quota: off

performance.quick-read: on

performance.stat-prefetch: off

performance.io-cache: on

performance.cache-refresh-timeout: 1

nfs.disable: on

performance.cache-max-file-size: 200kb

performance.cache-size: 2GB

performance.write-behind-window-size: 4MB

performance.read-ahead: off

storage.linux-aio: off

diagnostics.brick-sys-log-level: WARNING

cluster.self-heal-daemon: off

Volume Name: archive2

Type: Distributed-Replicate

Volume ID: 0fe86e42-e67f-46d8-8ed0-d0e34f539d69

Status: Started

Number of Bricks: 4 x 2 = 8

Transport-type: tcp

Bricks:

Brick1: glucfshead1:/data/glusterfs/archive2/brick1

Brick2: glucfshead5:/data/glusterfs/archive2/brick1

Brick3: glucfshead2:/data/glusterfs/archive2/brick1

Brick4: glucfshead6:/data/glusterfs/archive2/brick1

Brick5: glucfshead3:/data/glusterfs/archive2/brick1

Brick6: glucfshead7:/data/glusterfs/archive2/brick1

Brick7: glucfshead4:/data/glusterfs/archive2/brick1

Brick8: glucfshead8:/data/glusterfs/archive2/brick1

Options Reconfigured:

cluster.metadata-self-heal: off

cluster.entry-self-heal: off

cluster.data-self-heal: off

diagnostics.count-fop-hits: on

diagnostics.latency-measurement: on

features.lock-heal: on

diagnostics.brick-sys-log-level: WARNING

storage.linux-aio: off

performance.read-ahead: off

performance.write-behind-window-size: 4MB

performance.cache-size: 2GB

performance.cache-max-file-size: 200kb

nfs.disable: on

performance.cache-refresh-timeout: 1

performance.io-cache: on

performance.stat-prefetch: off

performance.quick-read: on

features.quota: off

performance.io-thread-count: 16

performance.flush-behind: off

auth.allow: 172.16.15.*

cluster.readdir-optimize: on

cluster.self-heal-daemon: off

Some time ago node, glucfshead1 broke down. After some fiddling it was

decided not to deal with that immediately because the gluster was in

production and a rebuild on 3.4 would basically render the gluster unusable.

Recently it was felt that we needed to deal with the situation and we

hired some experts to deal with the problem. So we reinstalled the

broken node and gave it a new name/ip and upgraded all systems to 3.6.4.

The plan was to probe the "new" node into the gluster and then do a

brick-replace on it. However that did not go as expected.

The node that we removed is now listed as "Peer Rejected":

root@glucfshead2:~# gluster peer status

Number of Peers: 7

Hostname: glucfshead1

Uuid: 09ed9a29-c923-4dc5-957a-e0d3e8032daf

State: Peer Rejected (Disconnected)

Hostname: glucfshead3

Uuid: a17ae95d-4598-4cd7-9ae7-808af10fedb5

State: Peer in Cluster (Connected)

Hostname: glucfshead4

Uuid: 8547dadd-96bf-45fe-b49d-bab8f995c928

State: Peer in Cluster (Connected)

Hostname: glucfshead5

Uuid: 249da8ea-fda6-47ff-98e0-dbff99dcb3f2

State: Peer in Cluster (Connected)

Hostname: glucfshead6

Uuid: a0229511-978c-4904-87ae-7e1b32ac2c72

State: Peer in Cluster (Connected)

Hostname: glucfshead7

Uuid: 548ec75a-0131-4c92-aaa9-7c6ee7b47a63

State: Peer in Cluster (Connected)

Hostname: glucfshead8

Uuid: 5e54cbc1-482c-460b-ac38-00c4b71c50b9

State: Peer in Cluster (Connected)

If I probe the replacement node (glucfshead9) it only ever shows up on

one of my running nodes and it's in state "Rejected Peer (Connected)".

How can we fix this - preferably without losing data?

TIA,

Thomas

---

Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.

https://www.avast.com/antivirus

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users