Re: Dispersed volumes won't heal on ARM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Filed a bug report. I was not able to reproduce the issue on x86 hardware.

https://bugzilla.redhat.com/show_bug.cgi?id=1811373



On Mon, Mar 2, 2020 at 1:58 AM Strahil Nikolov <hunter86_bg@xxxxxxxxx> wrote:
On March 2, 2020 3:29:06 AM GMT+02:00, Fox <foxxz.net@xxxxxxxxx> wrote:
>The brick is mounted. However glusterfsd crashes shortly after startup.
>This happens on any host that needs to heal a dispersed volume.
>
>I spent today doing a clean rebuild of the cluster. Clean install of
>ubuntu
>18 and gluster 7.2. Create a dispersed volume. Reboot one of the
>cluster
>members while the volume is up and online. When that cluster member
>comes
>back it can not heal.
>
>I was able to replicate this behavior with raspberry pis running
>raspbian
>and gluster 5 so it looks like its not limited to the specific hardware
>or
>version of gluster I'm using but perhaps the ARM architecture as a
>whole.
>
>Thank you for your help. Aside from not using dispersed volumes I don't
>think there is much more I can do. Submit a bug report I guess :)
>
>
>
>
>
>On Sun, Mar 1, 2020 at 12:02 PM Strahil Nikolov <hunter86_bg@xxxxxxxxx>
>wrote:
>
>> On March 1, 2020 6:22:59 PM GMT+02:00, Fox <foxxz.net@xxxxxxxxx>
>wrote:
>> >Yes the brick was up and running. And I can see files on the brick
>> >created
>> >by connected clients up until the node was rebooted.
>> >
>> >This is what the volume status looks like after gluster12 was
>rebooted.
>> >Prior to reboot it showed as online and was otherwise operational.
>> >
>> >root@gluster01:~# gluster volume status
>> >Status of volume: disp1
>> >Gluster process                             TCP Port  RDMA Port
>Online
>> > Pid
>>
>>
>>------------------------------------------------------------------------------
>> >Brick gluster01:/exports/sda/brick1/disp1   49152     0          Y
>> >3931
>> >Brick gluster02:/exports/sda/brick1/disp1   49152     0          Y
>> >2755
>> >Brick gluster03:/exports/sda/brick1/disp1   49152     0          Y
>> >2787
>> >Brick gluster04:/exports/sda/brick1/disp1   49152     0          Y
>> >2780
>> >Brick gluster05:/exports/sda/brick1/disp1   49152     0          Y
>> >2764
>> >Brick gluster06:/exports/sda/brick1/disp1   49152     0          Y
>> >2760
>> >Brick gluster07:/exports/sda/brick1/disp1   49152     0          Y
>> >2740
>> >Brick gluster08:/exports/sda/brick1/disp1   49152     0          Y
>> >2729
>> >Brick gluster09:/exports/sda/brick1/disp1   49152     0          Y
>> >2772
>> >Brick gluster10:/exports/sda/brick1/disp1   49152     0          Y
>> >2791
>> >Brick gluster11:/exports/sda/brick1/disp1   49152     0          Y
>> >2026
>> >Brick gluster12:/exports/sda/brick1/disp1   N/A       N/A        N
>> >N/A
>> >Self-heal Daemon on localhost               N/A       N/A        Y
>> >3952
>> >Self-heal Daemon on gluster03               N/A       N/A        Y
>> >2808
>> >Self-heal Daemon on gluster02               N/A       N/A        Y
>> >2776
>> >Self-heal Daemon on gluster06               N/A       N/A        Y
>> >2781
>> >Self-heal Daemon on gluster07               N/A       N/A        Y
>> >2761
>> >Self-heal Daemon on gluster05               N/A       N/A        Y
>> >2785
>> >Self-heal Daemon on gluster08               N/A       N/A        Y
>> >2750
>> >Self-heal Daemon on gluster04               N/A       N/A        Y
>> >2801
>> >Self-heal Daemon on gluster09               N/A       N/A        Y
>> >2793
>> >Self-heal Daemon on gluster11               N/A       N/A        Y
>> >2047
>> >Self-heal Daemon on gluster10               N/A       N/A        Y
>> >2812
>> >Self-heal Daemon on gluster12               N/A       N/A        Y
>> >542
>> >
>> >Task Status of Volume disp1
>>
>>
>>------------------------------------------------------------------------------
>> >There are no active volume tasks
>> >
>> >On Sun, Mar 1, 2020 at 2:01 AM Strahil Nikolov
><hunter86_bg@xxxxxxxxx>
>> >wrote:
>> >
>> >> On March 1, 2020 6:08:31 AM GMT+02:00, Fox <foxxz.net@xxxxxxxxx>
>> >wrote:
>> >> >I am using a dozen odriod HC2 ARM systems each with a single
>> >HD/brick.
>> >> >Running ubuntu 18 and glusterfs 7.2 installed from the gluster
>PPA.
>> >> >
>> >> >I can create a dispersed volume and use it. If one of the cluster
>> >> >members
>> >> >duck out, say gluster12 reboots, when it comes back online it
>shows
>> >> >connected in the peer list but using
>> >> >gluster volume heal <volname> info summary
>> >> >
>> >> >It shows up as
>> >> >Brick gluster12:/exports/sda/brick1/disp1
>> >> >Status: Transport endpoint is not connected
>> >> >Total Number of entries: -
>> >> >Number of entries in heal pending: -
>> >> >Number of entries in split-brain: -
>> >> >Number of entries possibly healing: -
>> >> >
>> >> >Trying to force a full heal doesn't fix it. The cluster member
>> >> >otherwise
>> >> >works and heals for other non-disperse volumes even while showing
>up
>> >as
>> >> >disconnected for the dispersed volume.
>> >> >
>> >> >I have attached a terminal log of the volume creation and
>diagnostic
>> >> >output. Could this be an ARM specific problem?
>> >> >
>> >> >I tested a similar setup on x86 virtual machines. They were able
>to
>> >> >heal a
>> >> >dispersed volume no problem. One thing I see in the ARM logs I
>don't
>> >> >see in
>> >> >the x86 logs is lots of this..
>> >> >[2020-03-01 03:54:45.856769] W [MSGID: 122035]
>> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0: Executing
>> >> >operation
>> >> >with some subvolumes unavailable. (800). FOP : 'LOOKUP' failed on
>> >> >'(null)'
>> >> >with gfid 0d3c4cf3-e09c-4b9a-87d3-cdfc4f49b692
>> >> >[2020-03-01 03:54:45.910203] W [MSGID: 122035]
>> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0: Executing
>> >> >operation
>> >> >with some subvolumes unavailable. (800). FOP : 'LOOKUP' failed on
>> >> >'(null)'
>> >> >with gfid 0d806805-81e4-47ee-a331-1808b34949bf
>> >> >[2020-03-01 03:54:45.932734] I
>[rpc-clnt.c:1963:rpc_clnt_reconfig]
>> >> >0-disp1-client-11: changing port to 49152 (from 0)
>> >> >[2020-03-01 03:54:45.956803] W [MSGID: 122035]
>> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0: Executing
>> >> >operation
>> >> >with some subvolumes unavailable. (800). FOP : 'LOOKUP' failed on
>> >> >'(null)'
>> >> >with gfid d5768bad-7409-40f4-af98-4aef391d7ae4
>> >> >[2020-03-01 03:54:46.000102] W [MSGID: 122035]
>> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0: Executing
>> >> >operation
>> >> >with some subvolumes unavailable. (800). FOP : 'LOOKUP' failed on
>> >> >'(null)'
>> >> >with gfid 216f5583-e1b4-49cf-bef9-8cd34617beaf
>> >> >[2020-03-01 03:54:46.044184] W [MSGID: 122035]
>> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0: Executing
>> >> >operation
>> >> >with some subvolumes unavailable. (800). FOP : 'LOOKUP' failed on
>> >> >'(null)'
>> >> >with gfid 1b610b49-2d69-4ee6-a440-5d3edd6693d1
>> >>
>> >> Hi,
>> >>
>> >> Are you sure that the gluster bricks on this node is up and
>running ?
>> >> What is the output of 'gluster volume status' on this system ?
>> >>
>> >> Best Regards,
>> >> Strahil Nikolov
>> >>
>>
>> This seems like the brick is down.
>> Check with 'ps aux | grep glusterfsd | grep disp1' on the 'gluster12'
>.
>> Most probably it is down and you need  to verify the brick is
>properly
>> mounted.
>>
>> Best Regards,
>> Strahil Nikolov
>>

Hi Fox,


Submit a bug and provide a link in the mailing list (add  the gluster-devel in CC once you register for that).
Most probably it's a small thing that can be easily fixed.

Have you tried to:
gluster volume start <VOLNAME> force

Best Regards,
Strahil Nikolov
________



Community Meeting Calendar:

Schedule -
Every Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux