Re: OOM Kills glustershd process in 3.10.1

Amudhan P <amudhan83@xxxxxxxxx> · Fri, 28 Apr 2017 12:15:25 +0530

Thanks, for pointing out will check that.

On Thu, Apr 27, 2017 at 1:51 PM, Edvin Ekström <edvin.ekstrom@xxxxxxxxxxx> wrote:

    I've encountered the same issue, however in my case it seem to
      have been caused by a bug in the kernel that was present between
      4.4.0-58 - 4.4.0-63
      (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842),
      seeing how you are running 4.4.0-62 I would suggest upgrading and
      see if the error persists. 

    Edvin Ekström,
    On 2017-04-26 09:09, Amudhan P wrote:

      I did volume start force and now self-heal daemon
        is up on the node which was down.

        But bitrot has triggered crawling process on all node now,
           why was it crawling disk again?  if the process is running
          already.

        [output from bitd.log]

          [2017-04-13 06:01:23.930089] I
            [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No
            change in volfile, continuing
          [2017-04-26 06:51:46.998935] I [MSGID: 100030]
            [glusterfsd.c:2460:main] 0-/usr/local/sbin/glusterfs:
            Started running /usr/local/sbin/glusterfs version 3.10.1
            (args: /usr/local/sbin/glusterfs -s localhost --volfile-id
            gluster/bitd -p /var/lib/glusterd/bitd/run/bitd.pid -l
            /var/log/glusterfs/bitd.log -S
            /var/run/gluster/02f1dd346d47b9006f9bf64e347338fd.socket
            --global-timer-wheel)
          [2017-04-26 06:51:47.002732] I [MSGID: 101190]
            [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll:
            Started thread with index 1

            On Tue, Apr 25, 2017 at 11:01 PM,
              Amudhan P <amudhan83@xxxxxxxxx>
              wrote:

              Yes, I have enabled
                bitrot process and it's currently running signer process
                in some nodes.

                Disabling and enabling bitrot doesn't makes difference
                it will start crawl process again right.

                    On Tuesday, April 25, 2017, Atin Mukherjee <amukherj@xxxxxxxxxx>
                    wrote:

                    >

                    >

                    > On Tue, Apr 25, 2017 at 9:22 PM, Amudhan P <amudhan83@xxxxxxxxx>
                    wrote:

                    >>

                    >> Hi Pranith,

                    >> if I restart glusterd service in the node
                    alone will it work. bcoz I feel that doing volume
                    force start will trigger bitrot process to crawl
                    disks in all nodes.

                    >

                    > Have you enabled bitrot? If not then the
                    process will not be in existence. As a workaround
                    you can always disable this option before executing
                    volume start force. Please note volume start force
                    doesn't affect any running processes.

                    >  

                    >>

                    >> yes, rebalance fix layout is on process.

                    >> regards

                    >> Amudhan

                    >>

                    >> On Tue, Apr 25, 2017 at 9:15 PM, Pranith
                    Kumar Karampuri <pkarampu@xxxxxxxxxx>
                    wrote:

                    >>>

                    >>> You can restart the process using:

                    >>> gluster volume start <volname>
                    force

                    >>>

                    >>> Did shd on this node heal a lot of
                    data? Based on the kind of memory usage it showed,
                    seems like there is a leak.

                    >>>

                    >>>

                    >>> Sunil,

                    >>>        Could you find if there any
                    leaks in this particular version that we might have
                    missed in our testing?

                    >>>

                    >>> On Tue, Apr 25, 2017 at 8:37 PM,
                    Amudhan P <amudhan83@xxxxxxxxx>
                    wrote:

                    >>>>

                    >>>> Hi,

                    >>>> In one of my node glustershd
                    process is killed due to OOM and this happened only
                    in one node out of 40 node cluster.

                    >>>> Node running on Ubuntu 16.04.2.

                    >>>> dmesg output:

                    >>>> [Mon Apr 24 17:21:38 2017] nrpe
                    invoked oom-killer: gfp_mask=0x26000c0, order=2,
                    oom_score_adj=0

                    >>>> [Mon Apr 24 17:21:38 2017] nrpe
                    cpuset=/ mems_allowed=0

                    >>>> [Mon Apr 24 17:21:38 2017] CPU: 0
                    PID: 12626 Comm: nrpe Not tainted 4.4.0-62-generic
                    #83-Ubuntu

                    >>>> [Mon Apr 24 17:21:38 2017]
                     0000000000000286 00000000fc26b170 ffff88048bf27af0
                    ffffffff813f7c63

                    >>>> [Mon Apr 24 17:21:38 2017]
                     ffff88048bf27cc8 ffff88082a663c00 ffff88048bf27b60
                    ffffffff8120ad4e

                    >>>> [Mon Apr 24 17:21:38 2017]
                     ffff88087781a870 ffff88087781a860 ffffea0011285a80
                    0000000100000001

                    >>>> [Mon Apr 24 17:21:38 2017] Call
                    Trace:

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff813f7c63>] dump_stack+0x63/0x90

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff8120ad4e>] dump_header+0x5a/0x1c5

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff811926c2>]
                    oom_kill_process+0x202/0x3c0

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff81192ae9>]
                    out_of_memory+0x219/0x460

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff81198a5d>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff81198e56>]
                    __alloc_pages_nodemask+0x286/0x2a0

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff81198f0b>]
                    alloc_kmem_pages_node+0x4b/0xc0

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff8107ea5e>]
                    copy_process+0x1be/0x1b70

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff8122d013>] ? __fd_install+0x33/0xe0

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff81713d01>] ?
                    release_sock+0x111/0x160

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff810805a0>] _do_fork+0x80/0x360

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff8122429c>] ? SyS_select+0xcc/0x110

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff81080929>] SyS_clone+0x19/0x20

                    >>>> [Mon Apr 24 17:21:38 2017]
                     [<ffffffff818385f2>]
                    entry_SYSCALL_64_fastpath+0x16/0x71

                    >>>> [Mon Apr 24 17:21:38 2017]
                    Mem-Info:

                    >>>> [Mon Apr 24 17:21:38 2017]
                    active_anon:553952 inactive_anon:206987
                    isolated_anon:0

                    >>>>                            
                    active_file:3410764 inactive_file:3460179
                    isolated_file:0

                    >>>>                            
                    unevictable:4914 dirty:212868 writeback:0 unstable:0

                    >>>>                            
                    slab_reclaimable:386621 slab_unreclaimable:31829

                    >>>>                            
                    mapped:6112 shmem:211 pagetables:6178 bounce:0

                    >>>>                            
                    free:82623 free_pcp:213 free_cma:0

                    >>>> [Mon Apr 24 17:21:38 2017] Node 0
                    DMA free:15880kB min:32kB low:40kB high:48kB
                    active_anon:0kB inactive_anon:0k

                    >>>> B active_file:0kB inactive_file:0kB
                    unevictable:0kB isolated(anon):0kB
                    isolated(file):0kB present:15964kB manag

                    >>>> ed:15880kB mlocked:0kB dirty:0kB
                    writeback:0kB mapped:0kB shmem:0kB
                    slab_reclaimable:0kB slab_unreclaimable:0kB

                    >>>>  kernel_stack:0kB pagetables:0kB
                    unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
                    free_cma:0kB writeback_tmp:

                    >>>> 0kB pages_scanned:0
                    all_unreclaimable? yes

                    >>>> [Mon Apr 24 17:21:38 2017]
                    lowmem_reserve[]: 0 1868 31944 31944 31944

                    >>>> [Mon Apr 24 17:21:38 2017] Node 0
                    DMA32 free:133096kB min:3948kB low:4932kB
                    high:5920kB active_anon:170764kB in

                    >>>> active_anon:206296kB
                    active_file:394236kB inactive_file:525288kB
                    unevictable:980kB isolated(anon):0kB isolated(

                    >>>> file):0kB present:2033596kB
                    managed:1952976kB mlocked:980kB dirty:1552kB
                    writeback:0kB mapped:3904kB shmem:724k

                    >>>> B slab_reclaimable:502176kB
                    slab_unreclaimable:8916kB kernel_stack:1952kB
                    pagetables:1408kB unstable:0kB bounce

                    >>>> :0kB free_pcp:0kB local_pcp:0kB
                    free_cma:0kB writeback_tmp:0kB pages_scanned:0
                    all_unreclaimable? no

                    >>>> [Mon Apr 24 17:21:38 2017]
                    lowmem_reserve[]: 0 0 30076 30076 30076

                    >>>> [Mon Apr 24 17:21:38 2017] Node 0
                    Normal free:181516kB min:63600kB low:79500kB
                    high:95400kB active_anon:2045044

                    >>>> kB inactive_anon:621652kB
                    active_file:13248820kB inactive_file:13315428kB
                    unevictable:18676kB isolated(anon):0kB
                    isolated(file):0kB present:31322112kB
                    managed:30798036kB mlocked:18676kB dirty:849920kB
                    writeback:0kB mapped:20544kB shmem:120kB
                    slab_reclaimable:1044308kB
                    slab_unreclaimable:118400kB kernel_stack:33792kB
                    pagetables:23304kB unstable:0kB bounce:0kB
                    free_pcp:852kB local_pcp:0kB free_cma:0kB
                    writeback_tmp:0kB pages_scanned:0 all_unreclaimable?
                    no

                    >>>> [Mon Apr 24 17:21:38 2017]
                    lowmem_reserve[]: 0 0 0 0 0

                    >>>> [Mon Apr 24 17:21:38 2017] Node 0
                    DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U)
                    1*128kB (U) 1*256kB (U) 0*512kB

                    >>>>  1*1024kB (U) 1*2048kB (M) 3*4096kB
                    (M) = 15880kB

                    >>>> [Mon Apr 24 17:21:38 2017] Node 0
                    DMA32: 18416*4kB (UME) 7480*8kB (UME) 0*16kB 0*32kB
                    0*64kB 0*128kB 0*256kB 0*

                    >>>> 512kB 0*1024kB 0*2048kB 0*4096kB =
                    133504kB

                    >>>> [Mon Apr 24 17:21:38 2017] Node 0
                    Normal: 44972*4kB (UMEH) 13*8kB (EH) 13*16kB (H)
                    13*32kB (H) 8*64kB (H) 2*128

                    >>>> kB (H) 0*256kB 0*512kB 0*1024kB
                    0*2048kB 0*4096kB = 181384kB

                    >>>> [Mon Apr 24 17:21:38 2017] Node 0
                    hugepages_total=0 hugepages_free=0 hugepages_surp=0
                    hugepages_size=1048576kB

                    >>>> [Mon Apr 24 17:21:38 2017] Node 0
                    hugepages_total=0 hugepages_free=0 hugepages_surp=0
                    hugepages_size=2048kB

                    >>>> [Mon Apr 24 17:21:38 2017] 6878703
                    total pagecache pages

                    >>>> [Mon Apr 24 17:21:38 2017] 2484
                    pages in swap cache

                    >>>> [Mon Apr 24 17:21:38 2017] Swap
                    cache stats: add 3533870, delete 3531386, find
                    3743168/4627884

                    >>>> [Mon Apr 24 17:21:38 2017] Free
                    swap  = 14976740kB

                    >>>> [Mon Apr 24 17:21:38 2017] Total
                    swap = 15623164kB

                    >>>> [Mon Apr 24 17:21:38 2017] 8342918
                    pages RAM

                    >>>> [Mon Apr 24 17:21:38 2017] 0 pages
                    HighMem/MovableOnly

                    >>>> [Mon Apr 24 17:21:38 2017] 151195
                    pages reserved

                    >>>> [Mon Apr 24 17:21:38 2017] 0 pages
                    cma reserved

                    >>>> [Mon Apr 24 17:21:38 2017] 0 pages
                    hwpoisoned

                    >>>> [Mon Apr 24 17:21:38 2017] [ pid ]
                      uid  tgid total_vm      rss nr_ptes nr_pmds
                    swapents oom_score_adj name

                    >>>> [Mon Apr 24 17:21:38 2017] [  566]
                        0   566    15064      460      33       3    
                    1108             0 systemd

                    >>>> -journal

                    >>>> [Mon Apr 24 17:21:38 2017] [  602]
                        0   602    23693      182      16       3      
                     0             0 lvmetad

                    >>>> [Mon Apr 24 17:21:38 2017] [  613]
                        0   613    11241      589      21       3    
                     264         -1000 systemd

                    >>>> -udevd

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1381]
                      100  1381    25081      440      19       3      
                    25             0 systemd

                    >>>> -timesyn

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1447]
                        0  1447     1100      307       7       3      
                     0             0 acpid

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1449]
                        0  1449     7252      374      21       3      
                    47             0 cron

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1451]
                        0  1451    77253      994      19       3      
                    10             0 lxcfs

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1483]
                        0  1483     6511      413      18       3      
                    42             0 atd

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1505]
                        0  1505     7157      286      18       3      
                    36             0 systemd

                    >>>> -logind

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1508]
                      104  1508    64099      376      27       4    
                     712             0 rsyslog

                    >>>> d

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1510]
                      107  1510    10723      497      25       3      
                    45          -900 dbus-da

                    >>>> emon

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1521]
                        0  1521    68970      178      38       3    
                     170             0 account

                    >>>> s-daemon

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1526]
                        0  1526     6548      785      16       3      
                    63             0 smartd

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1528]
                        0  1528    54412      146      31       5    
                    1806             0 snapd

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1578]
                        0  1578     3416      335      11       3      
                    24             0 mdadm

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1595]
                        0  1595    16380      470      35       3    
                     157         -1000 sshd

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1610]
                        0  1610    69295      303      40       4      
                    57             0 polkitd

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1618]
                        0  1618     1306       31       8       3      
                     0             0 iscsid

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1619]
                        0  1619     1431      877       8       3      
                     0           -17 iscsid

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1624]
                        0  1624   126363     8027     122       4  
                     22441             0 gluster

                    >>>> d

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1688]
                        0  1688     4884      430      15       3      
                    46             0 irqbala

                    >>>> nce

                    >>>> [Mon Apr 24 17:21:38 2017] [ 1699]
                        0  1699     3985      348      13       3      
                     0             0 agetty

                    >>>> [Mon Apr 24 17:21:38 2017] [ 7001]
                        0  7001   500631    27874     145       5    
                    3356             0 gluster

                    >>>> fsd

                    >>>> [Mon Apr 24 17:21:38 2017] [ 8136]
                        0  8136   500631    28760     141       5    
                    2390             0 gluster

                    >>>> fsd

                    >>>> [Mon Apr 24 17:21:38 2017] [ 9280]
                        0  9280   533529    27752     135       5    
                    3200             0 gluster

                    >>>> fsd

                    >>>> [Mon Apr 24 17:21:38 2017] [12626]
                      111 12626     5991      420      16       3    
                     113             0 nrpe

                    >>>> [Mon Apr 24 17:21:38 2017] [14342]
                        0 14342   533529    28377     135       5    
                    2176             0 gluster

                    >>>> fsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14361]
                        0 14361   534063    29190     136       5    
                    1972             0 gluster

                    >>>> fsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14380]
                        0 14380   533529    28104     136       6    
                    2437             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14399]
                        0 14399   533529    27552     131       5    
                    2808             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14418]
                        0 14418   533529    29588     138       5    
                    2697             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14437]
                        0 14437   517080    28671     146       5    
                    2170             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14456]
                        0 14456   533529    28083     139       5    
                    3359             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14475]
                        0 14475   533529    28054     134       5    
                    2954             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14494]
                        0 14494   533529    28594     135       5    
                    2311             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14513]
                        0 14513   533529    28911     138       5    
                    2833             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14532]
                        0 14532   533529    28259     134       6    
                    3145             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14551]
                        0 14551   533529    27875     138       5    
                    2267             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [14570]
                        0 14570   484716    28247     142       5    
                    2875             0 glusterfsd

                    >>>> [Mon Apr 24 17:21:38 2017] [27646]
                        0 27646  3697561   202086    2830      17  
                     16528             0 glusterfs

                    >>>> [Mon Apr 24 17:21:38 2017] [27655]
                        0 27655   787371    29588     197       6  
                     25472             0 glusterfs

                    >>>> [Mon Apr 24 17:21:38 2017] [27665]
                        0 27665   689585      605     108       6    
                    7008             0 glusterfs

                    >>>> [Mon Apr 24 17:21:38 2017] [29878]
                        0 29878   193833    36054     241       4  
                     41182             0 glusterfs

                    >>>> [Mon Apr 24 17:21:38 2017] Out of
                    memory: Kill process 27646 (glusterfs) score 17 or
                    sacrifice child

                    >>>> [Mon Apr 24 17:21:38 2017] Killed
                    process 27646 (glusterfs) total-vm:14790244kB,
                    anon-rss:795040kB, file-rss:13304kB

                    >>>> /var/log/glusterfs/glusterd.log

                    >>>> [2017-04-24 11:53:51.359603] I
                    [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify]
                    0-management: glustershd has disconnected from
                    glusterd.

                    >>>> what would have gone wrong?

                    >>>> regards

                    >>>> Amudhan

                    >>>>

                    >>>> _______________________________________________

                    >>>> Gluster-users mailing list

                    >>>> Gluster-users@xxxxxxxxxxx

                    >>>> http://lists.gluster.org/mailman/listinfo/gluster-users

                    >>>

                    >>>

                    >>>

                    >>> --

                    >>> Pranith

                    >>

                    >>

                    >> _______________________________________________

                    >> Gluster-users mailing list

                    >> Gluster-users@xxxxxxxxxxx

                    >> http://lists.gluster.org/mailman/listinfo/gluster-users

                    >

                    >

      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users