Re: Brick-Xlators crashes after Set-RO and Read

David Spisla <spisla80@xxxxxxxxx> · Fri, 17 May 2019 11:57:47 +0200

Hello Niels,

Am Fr., 17. Mai 2019 um 11:35 Uhr schrieb Niels de Vos <ndevos@xxxxxxxxxx>:
On Fri, May 17, 2019 at 11:17:52AM +0200, David Spisla wrote:

> Hello Niels,

> 

> Am Fr., 17. Mai 2019 um 10:21 Uhr schrieb Niels de Vos <ndevos@xxxxxxxxxx>:

> 

> > On Fri, May 17, 2019 at 09:50:28AM +0200, David Spisla wrote:

> > > Hello Vijay,

> > > thank you for the clarification. Yes, there is an unconditional

> > dereference

> > > in stbuf. It seems plausible that this causes the crash. I think a check

> > > like this should help:

> > >

> > > if (buf == NULL) {

> > >         goto out;

> > > }

> > > map_atime_from_server(this, buf);

> > >

> > > Is there a reason why buf can be NULL?

> >

> > It seems LOOKUP returned an error (errno=13: EACCES: Permission denied).

> > This is probably something you need to handle in worm_lookup_cbk. There

> > can be many reasons for a FOP to return an error, why it happened in

> > this case is a little difficult to say without (much) more details.

> >

> Yes, I will look for a way to handle that case.

> It is intended, that the struct stbuf ist NULL when an error happens?

Yes, in most error occasions it will not be possible to get a valid

stbuf.
I will do a check like this assuming that in case of an error 
op_errno != 0 and ret = -1

if (buf == NULL || op_errno != 0 || ret = -1) {
        goto out;
}

map_atime_from_server(this, buf); 

Does this fit?
Regards
David

Niels

> 

> Regards

> David Spisla

> 

> 

> > HTH,

> > Niels

> >

> >

> > >

> > > Regards

> > > David Spisla

> > >

> > >

> > > Am Fr., 17. Mai 2019 um 01:51 Uhr schrieb Vijay Bellur <

> > vbellur@xxxxxxxxxx>:

> > >

> > > > Hello David,

> > > >

> > > > From the backtrace it looks like stbuf is NULL in

> > map_atime_from_server()

> > > > as  worm_lookup_cbk has got an error (op_ret = -1, op_errno = 13). Can

> > you

> > > > please check if there is an unconditional dereference of stbuf in

> > > > map_atime_from_server()?

> > > >

> > > > Regards,

> > > > Vijay

> > > >

> > > > On Thu, May 16, 2019 at 2:36 AM David Spisla <spisla80@xxxxxxxxx>

> > wrote:

> > > >

> > > >> Hello Vijay,

> > > >>

> > > >> yes, we are using custom patches. It s a helper function, which is

> > > >> defined in xlator_helper.c and used in worm_lookup_cbk.

> > > >> Do you think this could be the problem? The functions only manipulates

> > > >> the atime in struct iattr

> > > >>

> > > >> Regards

> > > >> David Spisla

> > > >>

> > > >> Am Do., 16. Mai 2019 um 10:05 Uhr schrieb Vijay Bellur <

> > > >> vbellur@xxxxxxxxxx>:

> > > >>

> > > >>> Hello David,

> > > >>>

> > > >>> Do you have any custom patches in your deployment? I looked up v5.5

> > but

> > > >>> could not find the following functions referred to in the core:

> > > >>>

> > > >>> map_atime_from_server()

> > > >>> worm_lookup_cbk()

> > > >>>

> > > >>> Neither do I see xlator_helper.c in the codebase.

> > > >>>

> > > >>> Thanks,

> > > >>> Vijay

> > > >>>

> > > >>>

> > > >>> #0  map_atime_from_server (this=0x7fdef401af00, stbuf=0x0) at

> > > >>> ../../../../xlators/lib/src/xlator_helper.c:21

> > > >>>         __FUNCTION__ = "map_to_atime_from_server"

> > > >>> #1  0x00007fdef39a0382 in worm_lookup_cbk (frame=frame@entry

> > =0x7fdeac0015c8,

> > > >>> cookie=<optimized out>, this=0x7fdef401af00, op_ret=op_ret@entry=-1,

> > > >>> op_errno=op_errno@entry=13,

> > > >>>     inode=inode@entry=0x0, buf=0x0, xdata=0x0, postparent=0x0) at

> > > >>> worm.c:531

> > > >>>         priv = 0x7fdef4075378

> > > >>>         ret = 0

> > > >>>         __FUNCTION__ = "worm_lookup_cbk"

> > > >>>

> > > >>> On Thu, May 16, 2019 at 12:53 AM David Spisla <spisla80@xxxxxxxxx>

> > > >>> wrote:

> > > >>>

> > > >>>> Hello Vijay,

> > > >>>>

> > > >>>> I could reproduce the issue. After doing a simple DIR Listing from

> > > >>>> Win10 powershell, all brick processes crashes. Its not the same

> > scenario

> > > >>>> mentioned before but the crash report in the bricks log is the same.

> > > >>>> Attached you find the backtrace.

> > > >>>>

> > > >>>> Regards

> > > >>>> David Spisla

> > > >>>>

> > > >>>> Am Di., 7. Mai 2019 um 20:08 Uhr schrieb Vijay Bellur <

> > > >>>> vbellur@xxxxxxxxxx>:

> > > >>>>

> > > >>>>> Hello David,

> > > >>>>>

> > > >>>>> On Tue, May 7, 2019 at 2:16 AM David Spisla <spisla80@xxxxxxxxx>

> > > >>>>> wrote:

> > > >>>>>

> > > >>>>>> Hello Vijay,

> > > >>>>>>

> > > >>>>>> how can I create such a core file? Or will it be created

> > > >>>>>> automatically if a gluster process crashes?

> > > >>>>>> Maybe you can give me a hint and will try to get a backtrace.

> > > >>>>>>

> > > >>>>>

> > > >>>>> Generation of core file is dependent on the system configuration.

> > > >>>>> `man 5 core` contains useful information to generate a core file

> > in a

> > > >>>>> directory. Once a core file is generated, you can use gdb to get a

> > > >>>>> backtrace of all threads (using "thread apply all bt full").

> > > >>>>>

> > > >>>>>

> > > >>>>>> Unfortunately this bug is not easy to reproduce because it appears

> > > >>>>>> only sometimes.

> > > >>>>>>

> > > >>>>>

> > > >>>>> If the bug is not easy to reproduce, having a backtrace from the

> > > >>>>> generated core would be very useful!

> > > >>>>>

> > > >>>>> Thanks,

> > > >>>>> Vijay

> > > >>>>>

> > > >>>>>

> > > >>>>>>

> > > >>>>>> Regards

> > > >>>>>> David Spisla

> > > >>>>>>

> > > >>>>>> Am Mo., 6. Mai 2019 um 19:48 Uhr schrieb Vijay Bellur <

> > > >>>>>> vbellur@xxxxxxxxxx>:

> > > >>>>>>

> > > >>>>>>> Thank you for the report, David. Do you have core files

> > available on

> > > >>>>>>> any of the servers? If yes, would it be possible for you to

> > provide a

> > > >>>>>>> backtrace.

> > > >>>>>>>

> > > >>>>>>> Regards,

> > > >>>>>>> Vijay

> > > >>>>>>>

> > > >>>>>>> On Mon, May 6, 2019 at 3:09 AM David Spisla <spisla80@xxxxxxxxx>

> > > >>>>>>> wrote:

> > > >>>>>>>

> > > >>>>>>>> Hello folks,

> > > >>>>>>>>

> > > >>>>>>>> we have a client application (runs on Win10) which does some

> > FOPs

> > > >>>>>>>> on a gluster volume which is accessed by SMB.

> > > >>>>>>>>

> > > >>>>>>>> *Scenario 1* is a READ Operation which reads all files

> > > >>>>>>>> successively and checks if the files data was correctly copied.

> > While doing

> > > >>>>>>>> this, all brick processes crashes and in the logs one have this

> > crash

> > > >>>>>>>> report on every brick log:

> > > >>>>>>>>

> > > >>>>>>>>>

> > CTX_ID:a0359502-2c76-4fee-8cb9-365679dc690e-GRAPH_ID:0-PID:32934-HOST:XX-XXXXX-XX-XX-PC_NAME:shortterm-client-2-RECON_NO:-0,

> > gfid: 00000000-0000-0000-0000-000000000001,

> > req(uid:2000,gid:2000,perm:1,ngrps:1),

> > ctx(uid:0,gid:0,in-groups:0,perm:700,updated-fop:LOOKUP, acl:-) [Permission

> > denied]

> > > >>>>>>>>> pending frames:

> > > >>>>>>>>> frame : type(0) op(27)

> > > >>>>>>>>> frame : type(0) op(40)

> > > >>>>>>>>> patchset: git://git.gluster.org/glusterfs.git

> > > >>>>>>>>> signal received: 11

> > > >>>>>>>>> time of crash:

> > > >>>>>>>>> 2019-04-16 08:32:21

> > > >>>>>>>>> configuration details:

> > > >>>>>>>>> argp 1

> > > >>>>>>>>> backtrace 1

> > > >>>>>>>>> dlfcn 1

> > > >>>>>>>>> libpthread 1

> > > >>>>>>>>> llistxattr 1

> > > >>>>>>>>> setfsid 1

> > > >>>>>>>>> spinlock 1

> > > >>>>>>>>> epoll.h 1

> > > >>>>>>>>> xattr.h 1

> > > >>>>>>>>> st_atim.tv_nsec 1

> > > >>>>>>>>> package-string: glusterfs 5.5

> > > >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f9a5bd4d64c]

> > > >>>>>>>>>

> > /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f9a5bd57d26]

> > > >>>>>>>>> /lib64/libc.so.6(+0x361a0)[0x7f9a5af141a0]

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/worm.so(+0xb910)[0x7f9a4ef0e910]

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/worm.so(+0x8118)[0x7f9a4ef0b118]

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/locks.so(+0x128d6)[0x7f9a4f1278d6]

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/access-control.so(+0x575b)[0x7f9a4f35975b]

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/locks.so(+0xb3b3)[0x7f9a4f1203b3]

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/worm.so(+0x85b2)[0x7f9a4ef0b5b2]

> > > >>>>>>>>>

> > /usr/lib64/libglusterfs.so.0(default_lookup+0xbc)[0x7f9a5bdd7b6c]

> > > >>>>>>>>>

> > /usr/lib64/libglusterfs.so.0(default_lookup+0xbc)[0x7f9a5bdd7b6c]

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/upcall.so(+0xf548)[0x7f9a4e8cf548]

> > > >>>>>>>>>

> > /usr/lib64/libglusterfs.so.0(default_lookup_resume+0x1e2)[0x7f9a5bdefc22]

> > > >>>>>>>>> /usr/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f9a5bd733a5]

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/performance/io-threads.so(+0x6088)[0x7f9a4e6b7088]

> > > >>>>>>>>> /lib64/libpthread.so.0(+0x7569)[0x7f9a5b29f569]

> > > >>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7f9a5afd69af]

> > > >>>>>>>>>

> > > >>>>>>>>> *Scenario 2 *The application just SET Read-Only on each file

> > > >>>>>>>> sucessively. After the 70th file was set, all the bricks

> > crashes and again,

> > > >>>>>>>> one can read this crash report in every brick log:

> > > >>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>> [2019-05-02 07:43:39.953591] I [MSGID: 139001]

> > > >>>>>>>>> [posix-acl.c:263:posix_acl_log_permit_denied]

> > 0-longterm-access-control:

> > > >>>>>>>>> client:

> > > >>>>>>>>>

> > CTX_ID:21aa9c75-3a5f-41f9-925b-48e4c80bd24a-GRAPH_ID:0-PID:16325-HOST:XXX-X-X-XXX-PC_NAME:longterm-client-0-RECON_NO:-0,

> > > >>>>>>>>> gfid: 00000000-0000-0000-0000-000000000001,

> > > >>>>>>>>> req(uid:2000,gid:2000,perm:1,ngrps:1),

> > > >>>>>>>>> ctx(uid:0,gid:0,in-groups:0,perm:700,updated-fop:LOOKUP,

> > acl:-) [Permission

> > > >>>>>>>>> denied]

> > > >>>>>>>>>

> > > >>>>>>>>> pending frames:

> > > >>>>>>>>>

> > > >>>>>>>>> frame : type(0) op(27)

> > > >>>>>>>>>

> > > >>>>>>>>> patchset: git://git.gluster.org/glusterfs.git

> > > >>>>>>>>>

> > > >>>>>>>>> signal received: 11

> > > >>>>>>>>>

> > > >>>>>>>>> time of crash:

> > > >>>>>>>>>

> > > >>>>>>>>> 2019-05-02 07:43:39

> > > >>>>>>>>>

> > > >>>>>>>>> configuration details:

> > > >>>>>>>>>

> > > >>>>>>>>> argp 1

> > > >>>>>>>>>

> > > >>>>>>>>> backtrace 1

> > > >>>>>>>>>

> > > >>>>>>>>> dlfcn 1

> > > >>>>>>>>>

> > > >>>>>>>>> libpthread 1

> > > >>>>>>>>>

> > > >>>>>>>>> llistxattr 1

> > > >>>>>>>>>

> > > >>>>>>>>> setfsid 1

> > > >>>>>>>>>

> > > >>>>>>>>> spinlock 1

> > > >>>>>>>>>

> > > >>>>>>>>> epoll.h 1

> > > >>>>>>>>>

> > > >>>>>>>>> xattr.h 1

> > > >>>>>>>>>

> > > >>>>>>>>> st_atim.tv_nsec 1

> > > >>>>>>>>>

> > > >>>>>>>>> package-string: glusterfs 5.5

> > > >>>>>>>>>

> > > >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fbb3f0b364c]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fbb3f0bdd26]

> > > >>>>>>>>>

> > > >>>>>>>>> /lib64/libc.so.6(+0x361e0)[0x7fbb3e27a1e0]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/worm.so(+0xb910)[0x7fbb32257910]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/worm.so(+0x8118)[0x7fbb32254118]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/locks.so(+0x128d6)[0x7fbb324708d6]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/access-control.so(+0x575b)[0x7fbb326a275b]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/locks.so(+0xb3b3)[0x7fbb324693b3]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/worm.so(+0x85b2)[0x7fbb322545b2]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/libglusterfs.so.0(default_lookup+0xbc)[0x7fbb3f13db6c]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/libglusterfs.so.0(default_lookup+0xbc)[0x7fbb3f13db6c]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/features/upcall.so(+0xf548)[0x7fbb31c18548]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/libglusterfs.so.0(default_lookup_resume+0x1e2)[0x7fbb3f155c22]

> > > >>>>>>>>>

> > > >>>>>>>>> /usr/lib64/libglusterfs.so.0(call_resume+0x75)[0x7fbb3f0d93a5]

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>>>

> > /usr/lib64/glusterfs/5.5/xlator/performance/io-threads.so(+0x6088)[0x7fbb31a00088]

> > > >>>>>>>>>

> > > >>>>>>>>> /lib64/libpthread.so.0(+0x7569)[0x7fbb3e605569]

> > > >>>>>>>>>

> > > >>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fbb3e33c9ef]

> > > >>>>>>>>>

> > > >>>>>>>>

> > > >>>>>>>> This happens on a 3-Node Gluster v5.5 Cluster on two different

> > > >>>>>>>> volumes. But both volumes has the same settings:

> > > >>>>>>>>

> > > >>>>>>>>> Volume Name: shortterm

> > > >>>>>>>>> Type: Replicate

> > > >>>>>>>>> Volume ID: 5307e5c5-e8a1-493a-a846-342fb0195dee

> > > >>>>>>>>> Status: Started

> > > >>>>>>>>> Snapshot Count: 0

> > > >>>>>>>>> Number of Bricks: 1 x 3 = 3

> > > >>>>>>>>> Transport-type: tcp

> > > >>>>>>>>> Bricks:

> > > >>>>>>>>> Brick1: fs-xxxxx-c1-n1:/gluster/brick4/glusterbrick

> > > >>>>>>>>> Brick2: fs-xxxxx-c1-n2:/gluster/brick4/glusterbrick

> > > >>>>>>>>> Brick3: fs-xxxxx-c1-n3:/gluster/brick4/glusterbrick

> > > >>>>>>>>> Options Reconfigured:

> > > >>>>>>>>> storage.reserve: 1

> > > >>>>>>>>> performance.client-io-threads: off

> > > >>>>>>>>> nfs.disable: on

> > > >>>>>>>>> transport.address-family: inet

> > > >>>>>>>>> user.smb: disable

> > > >>>>>>>>> features.read-only: off

> > > >>>>>>>>> features.worm: off

> > > >>>>>>>>> features.worm-file-level: on

> > > >>>>>>>>> features.retention-mode: enterprise

> > > >>>>>>>>> features.default-retention-period: 120

> > > >>>>>>>>> network.ping-timeout: 10

> > > >>>>>>>>> features.cache-invalidation: on

> > > >>>>>>>>> features.cache-invalidation-timeout: 600

> > > >>>>>>>>> performance.nl-cache: on

> > > >>>>>>>>> performance.nl-cache-timeout: 600

> > > >>>>>>>>> client.event-threads: 32

> > > >>>>>>>>> server.event-threads: 32

> > > >>>>>>>>> cluster.lookup-optimize: on

> > > >>>>>>>>> performance.stat-prefetch: on

> > > >>>>>>>>> performance.cache-invalidation: on

> > > >>>>>>>>> performance.md-cache-timeout: 600

> > > >>>>>>>>> performance.cache-samba-metadata: on

> > > >>>>>>>>> performance.cache-ima-xattrs: on

> > > >>>>>>>>> performance.io-thread-count: 64

> > > >>>>>>>>> cluster.use-compound-fops: on

> > > >>>>>>>>> performance.cache-size: 512MB

> > > >>>>>>>>> performance.cache-refresh-timeout: 10

> > > >>>>>>>>> performance.read-ahead: off

> > > >>>>>>>>> performance.write-behind-window-size: 4MB

> > > >>>>>>>>> performance.write-behind: on

> > > >>>>>>>>> storage.build-pgfid: on

> > > >>>>>>>>> features.utime: on

> > > >>>>>>>>> storage.ctime: on

> > > >>>>>>>>> cluster.quorum-type: fixed

> > > >>>>>>>>> cluster.quorum-count: 2

> > > >>>>>>>>> features.bitrot: on

> > > >>>>>>>>> features.scrub: Active

> > > >>>>>>>>> features.scrub-freq: daily

> > > >>>>>>>>> cluster.enable-shared-storage: enable

> > > >>>>>>>>>

> > > >>>>>>>>>

> > > >>>>>>>> Why can this happen to all Brick processes? I don't understand

> > the

> > > >>>>>>>> crash report. The FOPs are nothing special and after restart

> > brick

> > > >>>>>>>> processes everything works fine and our application was succeed.

> > > >>>>>>>>

> > > >>>>>>>> Regards

> > > >>>>>>>> David Spisla

> > > >>>>>>>>

> > > >>>>>>>>

> > > >>>>>>>>

> > > >>>>>>>> _______________________________________________

> > > >>>>>>>> Gluster-users mailing list

> > > >>>>>>>> Gluster-users@xxxxxxxxxxx

> > > >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users

> > > >>>>>>>

> > > >>>>>>>

> >

> > > _______________________________________________

> > > Gluster-users mailing list

> > > Gluster-users@xxxxxxxxxxx

> > > https://lists.gluster.org/mailman/listinfo/gluster-users

> >

> >

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users