On Fri, Mar 2, 2018 at 11:01 AM, Ravishankar N <ravishankar@xxxxxxxxxx> wrote:
Debugging further,
On 03/02/2018 10:11 AM, Ravishankar N wrote:
+ Anoop.I see this in a 2x1 plain distribute also. I see ENOTCONN for the upgraded brick on the old client:
It looks like clients on the old (3.12) nodes are not able to talk to the upgraded (4.0) node. I see messages like these on the old clients:
[2018-03-02 03:49:13.483458] W [MSGID: 114007] [client-handshake.c:1197:client_setvolume_cbk] 0-testvol-client-2: failed to find key 'clnt-lk-version' in the options
[2018-03-02 04:58:54.559446] E [MSGID: 114058] [client-handshake.c:1571:client_query_portmap_cbk] 0-testvol-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2018-03-02 04:58:54.559618] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-testvol-client-1: disconnected from testvol-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-03-02 04:58:56.973199] I [rpc-clnt.c:1994:rpc_clnt_reconfig] 0-testvol-client-1: changing port to 49152 (from 0)
[2018-03-02 04:58:56.975844] I [MSGID: 114057] [client-handshake.c:1484:select_server_supported_programs] 0-testvol-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-03-02 04:58:56.978114] W [MSGID: 114007] [client-handshake.c:1197:client_setvolume_cbk] 0-testvol-client-1: failed to find key 'clnt-lk-version' in the options
[2018-03-02 04:58:46.618036] E [MSGID: 114031] [client-rpc-fops.c:2768:client3_3_opendir_cbk] 0-testvol-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-00000 0000001) [Transport endpoint is not connected]
The message "W [MSGID: 114031] [client-rpc-fops.c:2577:client3_3_readdirp_cbk] 0-testvol-client-1: remote operation failed [Transport endpoint is not connected]" repeated 3 times between [2018-03-02 04:58:46.609529] and [2018-03-02 04:58:46.618683]
Also, mkdir fails on the old mount with EIO, though physically succeeding on both bricks. Can the rpc folks offer a helping hand?
Sometimes glusterfs returns wrong ia_type (IA_IFIFO to be
precise) in response of mkdir. This is the reason for failure. Note that
mkdir response from glusterfs says its successful, but with a wrong
iatt. That's the reason why we see directories created on bricks.
On debugging further, in dht_selfheal_dir_xattr_cbk, which gets executed as part of mkdir in dht,
(gdb)
677 ret = dict_get_bin (xdata, DHT_IATT_IN_XDATA_KEY, (void **) &stbuf);
(gdb)
692 LOCK (&frame->lock);
(gdb)
694 dht_iatt_merge (this, &local->stbuf, stbuf, subvol);
(gdb) p stbuf
$16 = (struct iatt *) 0x7f84e405aaf0
(gdb) p *stbuf
$17 = {ia_ino = 6143, ia_gfid = "\222\064\301\225~6v\242\021\b\000\000\000\000\000",
ia_dev = 0, ia_type = IA_IFIFO, ia_prot = {suid = 0 '\000', sgid = 0
'\000', sticky = 0 '\000', owner = {read = 0 '\000',
write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 0,
ia_rdev = 0, ia_size = 1520570685, ia_blksize = 1520570529, ia_blocks = 1520570714, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 172390349, ia_mtime_nsec = 475585538, ia_ctime = 626110118, ia_ctime_nsec = 0}
(gdb) p local->stbuf
$18 = {ia_ino = 11706604198702429330, ia_gfid = "e\223\246pH\005F\226\242v6~\225\301\064\222", ia_dev = 2065, ia_type = IA_IFDIR, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {
read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, ia_nlink = 2, ia_uid = 0,
ia_gid = 0, ia_rdev = 0, ia_size = 4096, ia_blksize = 4096, ia_blocks = 8, ia_atime = 1520570529, ia_atime_nsec = 475585538, ia_mtime = 1520570529, ia_mtime_nsec = 475585538, ia_ctime = 1520570529,
ia_ctime_nsec = 475585538}
(gdb) n
696 UNLOCK (&frame->lock);
(gdb) p local->stbuf
$19 = {ia_ino = 6143, ia_gfid = "\222\064\301\225~6v\242\021\b\000\000\000\000\000",
ia_dev = 0, ia_type = IA_IFIFO, ia_prot = {suid = 0 '\000', sgid = 0
'\000', sticky = 0 '\000', owner = {read = 0 '\000',
write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 0,
ia_rdev = 0, ia_size = 1520574781, ia_blksize = 1520570529, ia_blocks = 1520570722, ia_atime = 1520570529, ia_atime_nsec = 475585538, ia_mtime = 1520570529, ia_mtime_nsec = 475585538, ia_ctime = 1520570529,
ia_ctime_nsec = 475585538}
677 ret = dict_get_bin (xdata, DHT_IATT_IN_XDATA_KEY, (void **) &stbuf);
(gdb)
692 LOCK (&frame->lock);
(gdb)
694 dht_iatt_merge (this, &local->stbuf, stbuf, subvol);
(gdb) p stbuf
$16 = (struct iatt *) 0x7f84e405aaf0
(gdb) p *stbuf
$17 = {ia_ino = 6143, ia_gfid = "\222\064\301\225~6v\242\021\
write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 0,
ia_rdev = 0, ia_size = 1520570685, ia_blksize = 1520570529, ia_blocks = 1520570714, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 172390349, ia_mtime_nsec = 475585538, ia_ctime = 626110118, ia_ctime_nsec = 0}
(gdb) p local->stbuf
$18 = {ia_ino = 11706604198702429330, ia_gfid = "e\223\246pH\005F\226\242v6~\
read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, ia_nlink = 2, ia_uid = 0,
ia_gid = 0, ia_rdev = 0, ia_size = 4096, ia_blksize = 4096, ia_blocks = 8, ia_atime = 1520570529, ia_atime_nsec = 475585538, ia_mtime = 1520570529, ia_mtime_nsec = 475585538, ia_ctime = 1520570529,
ia_ctime_nsec = 475585538}
(gdb) n
696 UNLOCK (&frame->lock);
(gdb) p local->stbuf
$19 = {ia_ino = 6143, ia_gfid = "\222\064\301\225~6v\242\021\
write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 0,
ia_rdev = 0, ia_size = 1520574781, ia_blksize = 1520570529, ia_blocks = 1520570722, ia_atime = 1520570529, ia_atime_nsec = 475585538, ia_mtime = 1520570529, ia_mtime_nsec = 475585538, ia_ctime = 1520570529,
ia_ctime_nsec = 475585538}
So, we got correct iatt during mkdir, but wrong one while trying to set the layout on directory.
(gdb) p *stbuf
$26 = {ia_ino = 6143, ia_gfid = "L\rk\212\367\275\"\256\021\b\000\000\000\000\000",
ia_dev = 0, ia_type = IA_IFIFO, ia_prot = {suid = 0 '\000', sgid = 0
'\000', sticky = 0 '\000', owner = {read = 0 '\000',
write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 0,
ia_rdev = 0, ia_size = 1520571192, ia_blksize = 1520571192, ia_blocks = 1520571192, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 87784021, ia_mtime_nsec = 87784021, ia_ctime = 92784143, ia_ctime_nsec = 0}
(gdb) up
#1 0x00007f84eae8ead1 in client3_3_setxattr_cbk (req=0x7f84e0008130, iov=0x7f84e0008170, count=1, myframe=0x7f84e0008d80) at client-rpc-fops.c:1013
1013 CLIENT_STACK_UNWIND (setxattr, frame, rsp.op_ret, op_errno, xdata);
(gdb) p this->name
$27 = 0x7f84e4009190 "testvol-client-1"
$26 = {ia_ino = 6143, ia_gfid = "L\rk\212\367\275\"\256\021\b\
write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 0,
ia_rdev = 0, ia_size = 1520571192, ia_blksize = 1520571192, ia_blocks = 1520571192, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 87784021, ia_mtime_nsec = 87784021, ia_ctime = 92784143, ia_ctime_nsec = 0}
(gdb) up
#1 0x00007f84eae8ead1 in client3_3_setxattr_cbk (req=0x7f84e0008130, iov=0x7f84e0008170, count=1, myframe=0x7f84e0008d80) at client-rpc-fops.c:1013
1013 CLIENT_STACK_UNWIND (setxattr, frame, rsp.op_ret, op_errno, xdata);
(gdb) p this->name
$27 = 0x7f84e4009190 "testvol-client-1"
Breakpoint
12, dht_selfheal_dir_xattr_cbk (frame=0x7f84dc006a00,
cookie=0x7f84e4007c50, this=0x7f84e400ce80, op_ret=0, op_errno=0,
xdata=0x7f84e00017a0) at dht-selfheal.c:685
685 for (i = 0; i < layout->cnt; i++) {
(gdb) p *stbuf
$28 = {ia_ino = 12547800382684466508, ia_gfid = "\020{mk\200\067Kq\256\"\275\367\212k\rL", ia_dev = 2065, ia_type = IA_IFDIR, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {
read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, ia_nlink = 2, ia_uid = 0,
ia_gid = 0, ia_rdev = 0, ia_size = 6, ia_blksize = 4096, ia_blocks = 0, ia_atime = 1520571192, ia_atime_nsec = 90026323, ia_mtime = 1520571192, ia_mtime_nsec = 90026323, ia_ctime = 1520571192,
ia_ctime_nsec = 94026420}
(gdb) up
#1 0x00007f84eae8ead1 in client3_3_setxattr_cbk (req=0x7f84e000a5f0, iov=0x7f84e000a630, count=1, myframe=0x7f84e000aa00) at client-rpc-fops.c:1013
1013 CLIENT_STACK_UNWIND (setxattr, frame, rsp.op_ret, op_errno, xdata);
(gdb) p this->name
$29 = 0x7f84e4008810 "testvol-client-0"
685 for (i = 0; i < layout->cnt; i++) {
(gdb) p *stbuf
$28 = {ia_ino = 12547800382684466508, ia_gfid = "\020{mk\200\067Kq\256\"\275\
read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, ia_nlink = 2, ia_uid = 0,
ia_gid = 0, ia_rdev = 0, ia_size = 6, ia_blksize = 4096, ia_blocks = 0, ia_atime = 1520571192, ia_atime_nsec = 90026323, ia_mtime = 1520571192, ia_mtime_nsec = 90026323, ia_ctime = 1520571192,
ia_ctime_nsec = 94026420}
(gdb) up
#1 0x00007f84eae8ead1 in client3_3_setxattr_cbk (req=0x7f84e000a5f0, iov=0x7f84e000a630, count=1, myframe=0x7f84e000aa00) at client-rpc-fops.c:1013
1013 CLIENT_STACK_UNWIND (setxattr, frame, rsp.op_ret, op_errno, xdata);
(gdb) p this->name
$29 = 0x7f84e4008810 "testvol-client-0"
As
can be seen above, its always new brick (testvol-client-1) that returns
wrong iatt with ia_type IA_FIFO. old client (testvol-client-0) returns
correct iatt.
We need to debug further on what
in client-1 (which is running 4.0) resulted in wrong iatt. Note that the
iatt is got from dictionary. So, dictionary changes in 4.0 is one
suspect.
Thanks to Ravi for providing a live setup, which made my life easy :).
-Ravi
Is there something more to be done on BZ 1544366?
-Ravi
On 03/02/2018 08:44 AM, Ravishankar N wrote:
On 03/02/2018 07:26 AM, Shyam Ranganathan wrote:
Hi Pranith/Ravi,
So, to keep a long story short, post upgrading 1 node in a 3 node 3.13
cluster, self-heal is not able to catch the heal backlog and this is a
very simple synthetic test anyway, but the end result is that upgrade
testing is failing.
Let me try this now and get back. I had done some thing similar when testing the FIPS patch and the rolling upgrade had worked.
Thanks,
Ravi
Here are the details,
- Using
https://hackmd.io/GYIwTADCDsDMCGBaArAUxAY0QFhBAbIgJwCMySIwJm AJvGMBvNEA#
I setup 3 server containers to install 3.13 first as follows (within the
containers)
(inside the 3 server containers)
yum -y update; yum -y install centos-release-gluster313; yum install
glusterfs-server; glusterd
(inside centos-glfs-server1)
gluster peer probe centos-glfs-server2
gluster peer probe centos-glfs-server3
gluster peer status
gluster v create patchy replica 3 centos-glfs-server1:/d/brick1
centos-glfs-server2:/d/brick2 centos-glfs-server3:/d/brick3
centos-glfs-server1:/d/brick4 centos-glfs-server2:/d/brick5
centos-glfs-server3:/d/brick6 force
gluster v start patchy
gluster v status
Create a client container as per the document above, and mount the above
volume and create 1 file, 1 directory and a file within that directory.
Now we start the upgrade process (as laid out for 3.13 here
http://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_ ):3.13/
- killall glusterfs glusterfsd glusterd
- yum install
http://cbs.centos.org/kojifiles/work/tasks/1548/311548/ centos-release-gluster40-0.9- 1.el7.centos.x86_64.rpm
- yum upgrade --enablerepo=centos-gluster40-test glusterfs-server
< Go back to the client and edit the contents of one of the files and
change the permissions of a directory, so that there are things to heal
when we bring up the newly upgraded server>
- gluster --version
- glusterd
- gluster v status
- gluster v heal patchy
The above starts failing as follows,
[root@centos-glfs-server1 /]# gluster v heal patchy
Launching heal operation to perform index self heal on volume patchy has
been unsuccessful:
Commit failed on centos-glfs-server2.glfstest20. Please check log file
for details.
Commit failed on centos-glfs-server3. Please check log file for details.
From here, if further files or directories are created from the client,
they just get added to the heal backlog, and heal does not catchup.
As is obvious, I cannot proceed, as the upgrade procedure is broken. The
issue itself may not be selfheal deamon, but something around
connections, but as the process fails here, looking to you guys to
unblock this as soon as possible, as we are already running a day's slip
in the release.
Thanks,
Shyam
_______________________________________________
maintainers mailing list
maintainers@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/maintainers
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel