Hi,
Recently there was a crash in locks translator (BZ
1103347, BZ 1097102) with the following backtrace:
(gdb) bt
#0 uuid_unpack (in=0x8 <Address 0x8 out of bounds>,
uu=0x7fffea6c6a60) at ../../contrib/uuid/unpack.c:44
#1 0x00007feeba9e19d6 in uuid_unparse_x (uu=<value
optimized out>, out=0x2350fc0
"081bbc7a-7551-44ac-85c7-aad5e2633db9",
fmt=0x7feebaa08e00
"%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x") at
../../contrib/uuid/unparse.c:55
#2 0x00007feeba9be837 in uuid_utoa (uuid=0x8 <Address
0x8 out of bounds>) at common-utils.c:2138
#3 0x00007feeb06e8a58 in pl_inodelk_log_cleanup
(this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:396
#4 pl_inodelk_client_cleanup (this=0x230d910,
ctx=0x7fee700f0c60) at inodelk.c:428
#5 0x00007feeb06ddf3a in pl_client_disconnect_cbk
(this=0x230d910, client=<value optimized out>) at
posix.c:2550
#6 0x00007feeba9fa2dd in gf_client_disconnect
(client=0x27724a0) at client_t.c:368
#7 0x00007feeab77ed48 in server_connection_cleanup
(this=0x2316390, client=0x27724a0, flags=<value optimized
out>) at server-helpers.c:354
#8 0x00007feeab77ae2c in server_rpc_notify (rpc=<value
optimized out>, xl=0x2316390, event=<value optimized
out>, data="" at server.c:527
#9 0x00007feeba775155 in rpcsvc_handle_disconnect
(svc=0x2325980, trans=0x2bf51c0) at rpcsvc.c:720
#10 0x00007feeba776c30 in rpcsvc_notify (trans=0x2bf51c0,
mydata=<value optimized out>, event=<value
optimized out>, data="" at rpcsvc.c:758
#11 0x00007feeba778638 in rpc_transport_notify
(this=<value optimized out>, event=<value optimized
out>, data="" optimized out>) at
rpc-transport.c:512
#12 0x00007feeb115e971 in socket_event_poll_err
(fd=<value optimized out>, idx=<value optimized
out>, data="" poll_in=<value optimized
out>, poll_out=0,
poll_err=0) at socket.c:1071
#13 socket_event_handler (fd=<value optimized out>,
idx=<value optimized out>, data=""
poll_in=<value optimized out>, poll_out=0, poll_err=0)
at socket.c:2240
#14 0x00007feeba9fc6a7 in event_dispatch_epoll_handler
(event_pool=0x22e2d00) at event-epoll.c:384
#15 event_dispatch_epoll (event_pool=0x22e2d00) at
event-epoll.c:445
#16 0x0000000000407e93 in main (argc=19,
argv=0x7fffea6c7f88) at glusterfsd.c:2023
(gdb) f 4
#4 pl_inodelk_client_cleanup (this=0x230d910,
ctx=0x7fee700f0c60) at inodelk.c:428
428 pl_inodelk_log_cleanup (l);
(gdb) p l->pl_inode->refkeeper
$1 = (inode_t *) 0x0
(gdb)
pl_inode->refkeeper was found to be NULL even when
there were some blocked inodelks in a certain domain of the
inode,
which when dereferenced by the epoll thread in the
cleanup codepath led to a crash.
On inspecting the code (for want of a consistent
reproducer), three things were found:
1. The function where the crash happens
(pl_inodelk_log_cleanup()), makes an attempt to resolve the
inode to path as can be seen below. But the way inode_path()
itself
works is to first construct the path based on the
given inode's ancestry and place it in the buffer provided.
And if all else fails, the gfid of the inode is placed in a
certain format ("<gfid:%s>").
This eliminates the need for statements from line 4
through 7 below, thereby "preventing" dereferencing of
pl_inode->refkeeper.
Now, although this change prevents the crash
altogether, it still does not fix the race that led to
pl_inode->refkeeper becoming NULL, and comes at the cost
of
printing "(null)" in the log message on line 9 every
time pl_inode->refkeeper is found to be NULL, rendering
the logged messages somewhat useless.
<code>
0 pl_inode = lock->pl_inode;
1
2 inode_path (pl_inode->refkeeper, NULL,
&path);
3
4 if (path)
5 file = path;
6 else
7 file = uuid_utoa
(pl_inode->refkeeper->gfid);
8
9 gf_log (THIS->name,
GF_LOG_WARNING,
10 "releasing lock on %s held by
"
11 "{client=%p, pid=%"PRId64"
lk-owner=%s}",
12 file, lock->client, (uint64_t)
lock->client_pid,
13 lkowner_utoa (&lock->owner));
<\code>