Re: Disperse volumes on armhf

Xavi Hernandez <jahernan@xxxxxxxxxx> · Mon, 6 Aug 2018 09:23:23 +0200

Hi,

On Sat, Aug 4, 2018 at 3:19 AM Fox <foxxz.net@xxxxxxxxx> wrote:
Replying to the last batch of questions I've received...

To reiterate, I am only having problems writing files to disperse volumes when mounting it on an armhf system. Mounting the same volume on an x86-64 system works fine.
Disperse volumes running on arm can not heal.

Replica volumes mount and heal just fine.

All bricks are up and running. I have ensured connectivity and that MTU is correct and identical.

Armhf is 32bit:
# uname -a
Linux gluster01 4.14.55-146 #1 SMP PREEMPT Wed Jul 11 22:31:01 -03 2018 armv7l armv7l armv7l GNU/Linux
# file /bin/bash
/bin/bash: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=e0a53f804173b0cd9845bb8a76fee1a1e98a9759, stripped
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic
# free
              total        used        free      shared  buff/cache   available
Mem:        2042428       83540     1671004        6052      287884     1895684
Swap:             0           0           0

8 cores total. 4x running 2ghz and 4x running 1.4ghz
processor       : 0
model name      : ARMv7 Processor rev 3 (v7l)
BogoMIPS        : 24.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xc07
CPU revision    : 3

processor       : 4
model name      : ARMv7 Processor rev 3 (v7l)
BogoMIPS        : 72.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc0f
CPU revision    : 3

There IS a 98MB /core file from the fuse mount so thats cool.
# file /core
/core: ELF 32-bit LSB core file ARM, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs --process-name fuse --volfile-server=gluster01 --volfile-id', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfs', platform: 'v7l'

On possible cause is some 64/32 bits inconsistency. If you have also installed the debug symbols and can provide a backtrace from the core dump, it would help to identify the problem.

Xavi

I will try and get a bug report with logs filed over the weekend.

This is just an experimental home cluster. I don't have anything on it yet. Its possible I could grant someone SSH access to the cluster if it helps further the gluster project. But the results should be reproducible on something like a raspberry pi. I was hoping to run a dispersed volume on it eventually otherwise I would have never found this issue.

Thank you for the troubleshooting ideas.

-Fox

On Fri, Aug 3, 2018 at 3:33 AM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
What is the endianness of the armhf CPU ?
Are you running a 32bit or 64bit Operating System ?

On Fri, Aug 3, 2018 at 9:51 AM, Fox <foxxz.net@xxxxxxxxx> wrote:
Just wondering if anyone else is running into the same behavior with disperse volumes described below and what I might be able to do about it.

I am using ubuntu 18.04LTS on Odroid HC-2 hardware (armhf) and have installed gluster 4.1.2 via PPA. I have 12 member nodes each with a single brick. I can successfully create a working volume via the command:

gluster volume create testvol1 disperse 12 redundancy 4 gluster01:/exports/sda/brick1/testvol1 gluster02:/exports/sda/brick1/testvol1 gluster03:/exports/sda/brick1/testvol1 gluster04:/exports/sda/brick1/testvol1 gluster05:/exports/sda/brick1/testvol1 gluster06:/exports/sda/brick1/testvol1 gluster07:/exports/sda/brick1/testvol1 gluster08:/exports/sda/brick1/testvol1 gluster09:/exports/sda/brick1/testvol1 gluster10:/exports/sda/brick1/testvol1 gluster11:/exports/sda/brick1/testvol1 gluster12:/exports/sda/brick1/testvol1

And start the volume:
gluster volume start testvol1

Mounting the volume on an x86-64 system it performs as expected.

Mounting the same volume on an armhf system (such as one of the cluster members) I can create directories but trying to create a file I get an error and the file system unmounts/crashes:
root@gluster01:~# mount -t glusterfs gluster01:/testvol1 /mnt
root@gluster01:~# cd /mnt
root@gluster01:/mnt# ls
root@gluster01:/mnt# mkdir test
root@gluster01:/mnt# cd test
root@gluster01:/mnt/test# cp /root/notes.txt ./
cp: failed to close './notes.txt': Software caused connection abort
root@gluster01:/mnt/test# ls
ls: cannot open directory '.': Transport endpoint is not connected

I get many of these in the glusterfsd.log:
The message "W [MSGID: 101088] [common-utils.c:4316:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 100 times between [2018-08-03 04:06:39.904166] and [2018-08-03 04:06:57.521895]

Furthermore, if a cluster member ducks out (reboots, loses connection, etc) and needs healing the self heal daemon logs messages similar to that above and can not heal - no disk activity (verified via iotop) though very high CPU usage and the volume heal info command indicates the volume needs healing.

I tested all of the above in virtual environments using x86-64 VMs and could self heal as expected.

Again this only happens when using disperse volumes. Should I be filing a bug report instead?

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

-- 
Milind

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users