Re: query about glusterfs 3.12-3 write-behind.c coredump

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Wed, Jan 30, 2019 at 9:44 AM Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> wrote:


On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) <deqian.li@xxxxxxxxxxxxxxx> wrote:

Hi,

 

Could you help to check this coredump?

We are using glusterfs 3.12-3(3 replicated bricks solution ) to do stability testing under high CPU load like 80% by stress and doing I/O.

After several hours, coredump happened in glusterfs side .

 

[Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))]

Missing separate debuginfos, use: dnf debuginfo-install rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64

(gdb) bt

#0  0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, liabilities=0x7fffdc234b50) at write-behind.c:1148

#1  0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at write-behind.c:1718

#2  0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)

    at write-behind.c:1825

#3  0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, opaque=0x7fffdc0305a0) at disk-usage.c:490

#4  0x00007ffff7b3510d in synctask_wrap () at syncop.c:377

#5  0x00007ffff60d0660 in ?? () from /lib64/libc.so.6

#6  0x0000000000000000 in ?? ()

(gdb) p wb_inode

$1 = (wb_inode_t *) 0x7fffd406b3b0

(gdb) frame 2

#2  0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)

    at write-behind.c:1825

1825         in write-behind.c

(gdb) p *fd

$2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next = 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,

        __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}},

  _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous = _gf_false}

(gdb) p fd

$3 = (fd_t *) 0x7fffe4034070

 

(gdb) p wb_inode->this

$1 = (xlator_t *) 0xffffffffffffff00

 

After adding test log I found the FOP  sequence in write-behind xlator side was mass as bellow showing.  In the FUSE side the FLUSH is after write2, but in the WB side, FLUSH is between write2 ‘wb_do_unwinds’ and ‘wb_fulfill’.

So I think this should has problem.


wb_do_unwinds the write after caching it.

* wb_do_unwinds unwind the write after caching it.

Once the write response is unwound, kernel issues a FLUSH. So, this is a valid sequence of operations and nothing wrong.

I think it’s possible that the FLUSH and later RELEASE operation will destroy the fd , it will cause ‘wb_in->this(0xffffffffffffff00)’. Do you think so?

And I think our new adding disk-usage xlator’s synctask_new will dealy the write operation, but the FLUSH operation without this delay(because not invoked the disk-usage xlator).


Flush and release wait for completion of any on-going writes. So, unless write-behind has unwound the write, it won't see a flush. By the time write is unwound, write-behind makes sure to take references on objects it uses (like fds, iobref etc). So, I don't see a problem there.

 

Do you agree with my speculation ? and how to fix?(we don’t want to move the disk-usage xlator)


I've still not found the RCA. We can discuss about the fix once RCA is found.

 

 

Problematic FOP sequence :

 

FUSE side:             WB side:

                     

Write 1                write1

Write2 do unwind

Write 2                FLUSH

                      Release(destroy fd)

FLUSH                write2  (wb_fulfill)  then coredump.

Release

 

 

int

wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities)

{

         wb_request_t  *req     = NULL;

         wb_request_t  *head    = NULL;

         wb_request_t  *tmp     = NULL;

         wb_conf_t     *conf    = NULL;

         off_t          expected_offset = 0;

         size_t         curr_aggregate = 0;

         size_t         vector_count = 0;

        int            ret          = 0;

 

         conf = wb_inode->this->private;   à this line coredump

 

         list_for_each_entry_safe (req, tmp, liabilities, winds) {

                  list_del_init (&req->winds);

 

….

 

 

volume ccs-write-behind

68:     type performance/write-behind

69:     subvolumes ccs-dht

70: end-volume

71: 

 72: volume ccs-disk-usage                      à we add a new xlator here for write op ,just for checking if disk if full.  And synctask_new for write.

73:     type performance/disk-usage

74:     subvolumes ccs-write-behind

75: end-volume

76: 

 77: volume ccs-read-ahead

78:     type performance/read-ahead

79:     subvolumes ccs-disk-usage

80: end-volume

 

 

 

Ps. Part of  Our new translator code

 

int

du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd,

            struct iovec *vector, int count, off_t off, uint32_t flags,

            struct iobref *iobref, dict_t *xdata)

{

    int           op_errno = -1;

    int           ret = -1;

    du_local_t  *local = NULL;

    loc_t          tmp_loc      = {0,};

 

    VALIDATE_OR_GOTO (frame, err);

    VALIDATE_OR_GOTO (this, err);

    VALIDATE_OR_GOTO (fd, err);

 

    tmp_loc.gfid[15] = 1;

    tmp_loc.inode = fd->inode;

    tmp_loc.parent = fd->inode;

    local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE);

    if (!local) {

 

            op_errno = ENOMEM;

            goto err;

    }

    local->vector = iov_dup (vector, count);

    local->offset = off;

    local->count = count;

    local->flags = flags;

    local->iobref = iobref_ref (iobref);

   

    ret = synctask_new(this->ctx->env, du_get_du_info,du_writev_resume,frame,frame);

    if(ret)

    {

            op_errno = -1;

            gf_log (this->name, GF_LOG_WARNING,"synctask_new return failure ret(%d)  ",ret);

            goto err;

    }

    return 0;

err:

    op_errno = (op_errno == -1) ? errno : op_errno;

    DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL);

    return 0;

}

 

Br,

Li Deqian

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux