Re: Stale locks on shards

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Mon, 29 Jan 2018 11:02:11 +0530

On 29 Jan 2018 10:50 am, "Samuli Heinonen" <samppah@xxxxxxxxxxxxx> wrote:
Hi!

Yes, thank you for asking. I found out this line in the production environment:

lgetxattr("/tmp/zone2-ssd1-vmstor1.s6jvPu//.shard/f349ffbd-a423-4fb2-b83c-2d1d5e78e1fb.32", "glusterfs.clrlk.tinode.kblocked", 0x7f2d7c4379f0, 4096) = -1 EPERM (Operation not permitted)

I was expecting .kall instead of .blocked,
did you change the cli to kind blocked?

And this one in test environment (with posix locks):

lgetxattr("/tmp/g1.gHj4Bw//file38", "glusterfs.clrlk.tposix.kblocked", "box1:/gluster/1/export/: posix blocked locks=1 granted locks=0", 4096) = 77

In test environment I tried running following command which seemed to release gluster locks:

getfattr -n glusterfs.clrlk.tposix.kblocked file38

So I think it would go like this in production environment with locks on shards (using aux-gfid-mount mount option):

getfattr -n glusterfs.clrlk.tinode.kall .shard/f349ffbd-a423-4fb2-b83c-2d1d5e78e1fb.32

I haven't been able to try this out in production environment yet.

Is there perhaps something else to notice?

Would you be able to tell more about bricks crashing after releasing locks? Under what circumstances that does happen? Is it only process exporting the brick crashes or is there a possibility of data corruption?

No data corruption. Brick process where you did clear-locks may crash.

Best regards,

Samuli Heinonen

Pranith Kumar Karampuri wrote:

Hi,

      Did you find the command from strace?

On 25 Jan 2018 1:52 pm, "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx

<mailto:pkarampu@xxxxxxxxxx>> wrote:

    On Thu, Jan 25, 2018 at 1:49 PM, Samuli Heinonen

    <samppah@xxxxxxxxxxxxx <mailto:samppah@xxxxxxxxxxxxx>> wrote:

        Pranith Kumar Karampuri kirjoitti 25.01.2018 07:09:

            On Thu, Jan 25, 2018 at 2:27 AM, Samuli Heinonen

            <samppah@xxxxxxxxxxxxx <mailto:samppah@xxxxxxxxxxxxx>> wrote:

                Hi!

                Thank you very much for your help so far. Could you

                please tell an

                example command how to use aux-gid-mount to remove

                locks? "gluster

                vol clear-locks" seems to mount volume by itself.

            You are correct, sorry, this was implemented around 7 years

            back and I

            forgot that bit about it :-(. Essentially it becomes a getxattr

            syscall on the file.

            Could you give me the clear-locks command you were trying to

            execute

            and I can probably convert it to the getfattr command?

        I have been testing this in test environment and with command:

        gluster vol clear-locks g1

        /.gfid/14341ccb-df7b-4f92-90d5-7814431c5a1c kind all inode

    Could you do strace of glusterd when this happens? It will have a

    getxattr with "glusterfs.clrlk" in the key. You need to execute that

    on the gfid-aux-mount

                Best regards,

                Samuli Heinonen

                    Pranith Kumar Karampuri <mailto:pkarampu@xxxxxxxxxx

                    <mailto:pkarampu@xxxxxxxxxx>>

                    23 January 2018 at 10.30

                    On Tue, Jan 23, 2018 at 1:38 PM, Samuli Heinonen

                    <samppah@xxxxxxxxxxxxx

                    <mailto:samppah@xxxxxxxxxxxxx>

                    <mailto:samppah@xxxxxxxxxxxxx

                    <mailto:samppah@xxxxxxxxxxxxx>>> wrote:

                    Pranith Kumar Karampuri kirjoitti 23.01.2018 09:34:

                    On Mon, Jan 22, 2018 at 12:33 AM, Samuli Heinonen

                    <samppah@xxxxxxxxxxxxx

                    <mailto:samppah@xxxxxxxxxxxxx>

                    <mailto:samppah@xxxxxxxxxxxxx

                    <mailto:samppah@xxxxxxxxxxxxx>>>

                    wrote:

                    Hi again,

                    here is more information regarding issue described

                    earlier

                    It looks like self healing is stuck. According to

                    "heal

                    statistics"

                    crawl began at Sat Jan 20 12:56:19 2018 and it's still

                    going on

                    (It's around Sun Jan 21 20:30 when writing this).

                    However

                    glustershd.log says that last heal was completed at

                    "2018-01-20

                    11:00:13.090697" (which is 13:00 UTC+2). Also "heal

                    info"

                    has been

                    running now for over 16 hours without any information.

                    In

                    statedump

                    I can see that storage nodes have locks on files and

                    some

                    of those

                    are blocked. Ie. Here again it says that ovirt8z2 is

                    having active

                    lock even ovirt8z2 crashed after the lock was

                    granted.:

                    [xlator.features.locks.zone2-ssd1-vmstor1-locks.inode]

                    path=/.shard/3d55f8cc-cda9-489a-b0a3-fd0f43d67876.27

                    mandatory=0

                    inodelk-count=3

                    lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:self-heal

                    inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,

                    start=0,

                    len=0, pid

                    = 18446744073709551610, owner=d0c6d857a87f0000,

                    client=0x7f885845efa0,

            connection-id=sto2z2.xxx-10975-2018/01/20-10:56:14:649541-zone2-ssd1-vmstor1-client-0-0-0,

                    granted at 2018-01-20 10:59:52

                    lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:metadata

                    lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0

                    inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,

                    start=0,

                    len=0, pid

                    = 3420, owner=d8b9372c397f0000, client=0x7f8858410be0,

                    connection-id=ovirt8z2.xxx.com

                    <http://ovirt8z2.xxx.com> [1]

            <http://ovirt8z2.xxx.com>-5652-2017/12/27-09:49:02:946825-zone2-ssd1-vmstor1-client-0-7-0,

                    granted at 2018-01-20 08:57:23

                    inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0,

                    start=0,

                    len=0,

                    pid = 18446744073709551610, owner=d0c6d857a87f0000,

                    client=0x7f885845efa0,

            connection-id=sto2z2.xxx-10975-2018/01/20-10:56:14:649541-zone2-ssd1-vmstor1-client-0-0-0,

                    blocked at 2018-01-20 10:59:52

                    I'd also like to add that volume had arbiter brick

                    before

                    crash

                    happened. We decided to remove it because we thought

                    that

                    it was

                    causing issues. However now I think that this was

                    unnecessary. After

                    the crash arbiter logs had lots of messages like this:

                    [2018-01-20 10:19:36.515717] I [MSGID: 115072]

                    [server-rpc-fops.c:1640:server_setattr_cbk]

                    0-zone2-ssd1-vmstor1-server: 37374187: SETATTR

                    <gfid:a52055bd-e2e9-42dd-92a3-e96b693bcafe>

                    (a52055bd-e2e9-42dd-92a3-e96b693bcafe) ==> (Operation

                    not

                    permitted)

                    [Operation not permitted]

                    Is there anyways to force self heal to stop? Any help

                    would be very

                    much appreciated :)

                    Exposing .shard to a normal mount is opening a can of

                    worms. You

                    should probably look at mounting the volume with gfid

                    aux-mount where

                    you can access a file with

                    <path-to-mount>/.gfid/<gfid-string>to clear

                    locks on it.

                    Mount command:  mount -t glusterfs -o aux-gfid-mount

                    vm1:test

                    /mnt/testvol

                    A gfid string will have some hyphens like:

                    11118443-1894-4273-9340-4b212fa1c0e4

                    That said. Next disconnect on the brick where you

                    successfully

                    did the

                    clear-locks will crash the brick. There was a bug in

                    3.8.x

                    series with

                    clear-locks which was fixed in 3.9.0 with a feature. The

                    self-heal

                    deadlocks that you witnessed also is fixed in 3.10

                    version

                    of the

                    release.

                    Thank you the answer. Could you please tell more

                    about crash?

                    What

                    will actually happen or is there a bug report about

                    it? Just

                    want

                    to make sure that we can do everything to secure data on

                    bricks.

                    We will look into upgrade but we have to make sure

                    that new

                    version works for us and of course get self healing

                    working

                    before

                    doing anything :)

                    Locks xlator/module maintains a list of locks that

                    are granted to

                    a client. Clear locks had an issue where it forgets

                    to remove the

                    lock from this list. So the connection list ends up

                    pointing to

                    data that is freed in that list after a clear lock.

                    When a

                    disconnect happens, all the locks that are granted

                    to a client

                    need to be unlocked. So the process starts

                    traversing through this

                    list and when it starts trying to access this freed

                    data it leads

                    to a crash. I found it while reviewing a feature

                    patch sent by

                    facebook folks to locks xlator

                    (http://review.gluster.org/14816

                    <http://review.gluster.org/14816>

                    [2]) for 3.9.0 and they also fixed this bug as well

                    as part of

                    that feature patch.

                    Br,

                    Samuli

                    3.8.x is EOLed, so I recommend you to upgrade to a

                    supported

                    version

                    soon.

                    Best regards,

                    Samuli Heinonen

                    Samuli Heinonen

                    20 January 2018 at 21.57

                    Hi all!

                    One hypervisor on our virtualization environment

                    crashed and now

                    some of the VM images cannot be accessed. After

                    investigation we

                    found out that there was lots of images that still

                    had

                    active lock

                    on crashed hypervisor. We were able to remove

                    locks

                    from "regular

                    files", but it doesn't seem possible to remove

                    locks

                    from shards.

                    We are running GlusterFS 3.8.15 on all nodes.

                    Here is part of statedump that shows shard having

                    active lock on

                    crashed node:

                    [xlator.features.locks.zone2-ssd1-vmstor1-locks.inode]

                    path=/.shard/75353c17-d6b8-485d-9baf-fd6c700e39a1.21

                    mandatory=0

                    inodelk-count=1

                    lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:metadata

                    lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:self-heal

                    lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0

                    inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,

                    start=0, len=0,

                    pid = 3568, owner=14ce372c397f0000,

                    client=0x7f3198388770,

                    connection-id

            ovirt8z2.xxx-5652-2017/12/27-09:49:02:946825-zone2-ssd1-vmstor1-client-1-7-0,

                    granted at 2018-01-20 08:57:24

                    If we try to run clear-locks we get following

                    error

                    message:

                    # gluster volume clear-locks zone2-ssd1-vmstor1

                    /.shard/75353c17-d6b8-485d-9baf-fd6c700e39a1.21

                    kind

                    all inode

                    Volume clear-locks unsuccessful

                    clear-locks getxattr command failed. Reason:

                    Operation not

                    permitted

                    Gluster vol info if needed:

                    Volume Name: zone2-ssd1-vmstor1

                    Type: Replicate

                    Volume ID: b6319968-690b-4060-8fff-b212d2295208

                    Status: Started

                    Snapshot Count: 0

                    Number of Bricks: 1 x 2 = 2

                    Transport-type: rdma

                    Bricks:

                    Brick1: sto1z2.xxx:/ssd1/zone2-vmstor1/export

                    Brick2: sto2z2.xxx:/ssd1/zone2-vmstor1/export

                    Options Reconfigured:

                    cluster.shd-wait-qlength: 10000

                    cluster.shd-max-threads: 8

                    cluster.locking-scheme: granular

                    performance.low-prio-threads: 32

                    cluster.data-self-heal-algorithm: full

                    performance.client-io-threads: off

                    storage.linux-aio: off

                    performance.readdir-ahead: on

                    client.event-threads: 16

                    server.event-threads: 16

                    performance.strict-write-ordering: off

                    performance.quick-read: off

                    performance.read-ahead: on

                    performance.io-cache: off

                    performance.stat-prefetch: off

                    cluster.eager-lock: enable

                    network.remote-dio: on

                    cluster.quorum-type: none

                    network.ping-timeout: 22

                    performance.write-behind: off

                    nfs.disable: on

                    features.shard: on

                    features.shard-block-size: 512MB

                    storage.owner-uid: 36

                    storage.owner-gid: 36

                    performance.io-thread-count: 64

                    performance.cache-size: 2048MB

                    performance.write-behind-window-size: 256MB

                    server.allow-insecure: on

                    cluster.ensure-durability: off

                    config.transport: rdma

                    server.outstanding-rpc-limit: 512

                    diagnostics.brick-log-level: INFO

                    Any recommendations how to advance from here?

                    Best regards,

                    Samuli Heinonen

                    _______________________________________________

                    Gluster-users mailing list

                    Gluster-users@xxxxxxxxxxx

                    <mailto:Gluster-users@gluster.org>

                    <mailto:Gluster-users@gluster.org

                    <mailto:Gluster-users@gluster.org>>

                    http://lists.gluster.org/mailman/listinfo/gluster-users

                    <http://lists.gluster.org/mailman/listinfo/gluster-users>

                    [3]

                    <http://lists.gluster.org/mailman/listinfo/gluster-users

                    <http://lists.gluster.org/mailman/listinfo/gluster-users>

                    [3]>

                    [1]

                    _______________________________________________

                    Gluster-users mailing list

                    Gluster-users@xxxxxxxxxxx

                    <mailto:Gluster-users@gluster.org>

                    <mailto:Gluster-users@gluster.org

                    <mailto:Gluster-users@gluster.org>>

                    http://lists.gluster.org/mailman/listinfo/gluster-users

                    <http://lists.gluster.org/mailman/listinfo/gluster-users>

                    [3]

                    <http://lists.gluster.org/mailman/listinfo/gluster-users

                    <http://lists.gluster.org/mailman/listinfo/gluster-users>

                    [3]> [1]

                    --

                    Pranith

                    Links:

                    ------

                    [1]

                    http://lists.gluster.org/mailman/listinfo/gluster-users

                    <http://lists.gluster.org/mailman/listinfo/gluster-users>

                    [3]

                    <http://lists.gluster.org/mailman/listinfo/gluster-users

                    <http://lists.gluster.org/mailman/listinfo/gluster-users>

                    [3]>

                    --

                    Pranith

                    Samuli Heinonen <mailto:samppah@xxxxxxxxxxxxx

                    <mailto:samppah@xxxxxxxxxxxxx>>

                    21 January 2018 at 21.03

                    Hi again,

                    here is more information regarding issue described

                    earlier

                    It looks like self healing is stuck. According to "heal

                    statistics" crawl began at Sat Jan 20 12:56:19 2018

                    and it's still

                    going on (It's around Sun Jan 21 20:30 when writing

                    this). However

                    glustershd.log says that last heal was completed at

                    "2018-01-20

                    11:00:13.090697" (which is 13:00 UTC+2). Also "heal

                    info" has been

                    running now for over 16 hours without any

                    information. In

                    statedump I can see that storage nodes have locks on

                    files and

                    some of those are blocked. Ie. Here again it says

                    that ovirt8z2 is

                    having active lock even ovirt8z2 crashed after the

                    lock was

                    granted.:

                    [xlator.features.locks.zone2-ssd1-vmstor1-locks.inode]

                    path=/.shard/3d55f8cc-cda9-489a-b0a3-fd0f43d67876.27

                    mandatory=0

                    inodelk-count=3

                    lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:self-heal

                    inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,

                    start=0, len=0,

                    pid = 18446744073709551610, owner=d0c6d857a87f0000,

                    client=0x7f885845efa0,

            connection-id=sto2z2.xxx-10975-2018/01/20-10:56:14:649541-zone2-ssd1-vmstor1-client-0-0-0,

                    granted at 2018-01-20 10:59:52

                    lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:metadata

                    lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0

                    inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,

                    start=0, len=0,

                    pid = 3420, owner=d8b9372c397f0000,

                    client=0x7f8858410be0,

                    connection-id=ovirt8z2.xxx.com <http://ovirt8z2.xxx.com>

                [1]-5652-2017/12/27-09:49:02:946825-zone2-ssd1-vmstor1-client-0-7-0,

                    granted at 2018-01-20 08:57:23

                    inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0,

                    start=0, len=0,

                    pid = 18446744073709551610, owner=d0c6d857a87f0000,

                    client=0x7f885845efa0,

            connection-id=sto2z2.xxx-10975-2018/01/20-10:56:14:649541-zone2-ssd1-vmstor1-client-0-0-0,

                    blocked at 2018-01-20 10:59:52

                    I'd also like to add that volume had arbiter brick

                    before crash