Re: Possible race condition bug with tiered volume

Dustin Black <dblack@xxxxxxxxxx> · Tue, 18 Oct 2016 19:09:29 -0400

Dang. I always think I get all the detail and inevitably leave out something important. :-/
I'm mobile and don't have the exact version in front of me, but this is recent if not latest RHGS on RHEL 7.2.

On Oct 18, 2016 7:04 PM, "Dan Lambright" <dlambrig@xxxxxxxxxx> wrote:
Dustin,

What level code ? I often run smallfile on upstream code with tiered volumes and have not seen this.

Sure, one of us will get back to you.

Unfortunately, gluster has a lot of protocol overhead (LOOKUPs), and they overwhelm the boost in transfer speeds you get for small files. A presentation at the Berlin gluster summit evaluated this.  The expectation is md-cache will go a long way towards helping that, before too long.

Dan

----- Original Message -----

> From: "Dustin Black" <dblack@xxxxxxxxxx>

> To: gluster-devel@xxxxxxxxxxx

> Cc: "Annette Clewett" <aclewett@xxxxxxxxxx>

> Sent: Tuesday, October 18, 2016 4:30:04 PM

> Subject:  Possible race condition bug with tiered volume

>

> I have a 3x2 hot tier on NVMe drives with a 3x2 cold tier on RAID6 drives.

>

> # gluster vol info 1nvme-distrep3x2

> Volume Name: 1nvme-distrep3x2

> Type: Tier

> Volume ID: 21e3fc14-c35c-40c5-8e46-c258c1302607

> Status: Started

> Number of Bricks: 12

> Transport-type: tcp

> Hot Tier :

> Hot Tier Type : Distributed-Replicate

> Number of Bricks: 3 x 2 = 6

> Brick1: n5:/rhgs/hotbricks/1nvme-distrep3x2-hot

> Brick2: n4:/rhgs/hotbricks/1nvme-distrep3x2-hot

> Brick3: n3:/rhgs/hotbricks/1nvme-distrep3x2-hot

> Brick4: n2:/rhgs/hotbricks/1nvme-distrep3x2-hot

> Brick5: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot

> Brick6: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot

> Cold Tier:

> Cold Tier Type : Distributed-Replicate

> Number of Bricks: 3 x 2 = 6

> Brick7: n0:/rhgs/coldbricks/1nvme-distrep3x2

> Brick8: n1:/rhgs/coldbricks/1nvme-distrep3x2

> Brick9: n2:/rhgs/coldbricks/1nvme-distrep3x2

> Brick10: n3:/rhgs/coldbricks/1nvme-distrep3x2

> Brick11: n4:/rhgs/coldbricks/1nvme-distrep3x2

> Brick12: n5:/rhgs/coldbricks/1nvme-distrep3x2

> Options Reconfigured:

> cluster.tier-mode: cache

> features.ctr-enabled: on

> performance.readdir-ahead: on

>

>

> I am attempting to run the 'smallfile' benchmark tool on this volume. The

> 'smallfile' tool creates a starting gate directory and files in a shared

> filesystem location. The first run (write) works as expected.

>

> # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top

> /rhgs/client/1nvme-distrep3x2 --host-set

> c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y

> --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create

>

> For the second run (read), I believe that smallfile attempts first to 'rm

> -rf' the "network-sync-dir" path, which fails with ENOTEMPTY, causing the

> run to fail

>

> # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top

> /rhgs/client/1nvme-distrep3x2 --host-set

> c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y

> --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create

> ...

> Traceback (most recent call last):

> File "/root/bin/smallfile_cli.py", line 280, in <module>

> run_workload()

> File "/root/bin/smallfile_cli.py", line 270, in run_workload

> return run_multi_host_workload(params)

> File "/root/bin/smallfile_cli.py", line 62, in run_multi_host_workload

> sync_files.create_top_dirs(master_invoke, True)

> File "/root/bin/sync_files.py", line 27, in create_top_dirs

> shutil.rmtree(master_invoke.network_dir)

> File "/usr/lib64/python2.7/shutil.py", line 256, in rmtree

> onerror(os.rmdir, path, sys.exc_info())

> File "/usr/lib64/python2.7/shutil.py", line 254, in rmtree

> os.rmdir(path)

> OSError: [Errno 39] Directory not empty: '/rhgs/client/1nvme-distrep3x2/smf1'

>

>

> From the client perspective, the directory is clearly empty.

>

> # ls -a /rhgs/client/1nvme-distrep3x2/smf1/

> . ..

>

>

> And a quick search on the bricks shows that the hot tier on the last replica

> pair is the offender.

>

> # for i in {0..5}; do ssh n$i "hostname; ls

> /rhgs/coldbricks/1nvme-distrep3x2/smf1 | wc -l; ls

> /rhgs/hotbricks/1nvme-distrep3x2-hot/smf1 | wc -l"; donerhosd0

> 0

> 0

> rhosd1

> 0

> 0

> rhosd2

> 0

> 0

> rhosd3

> 0

> 0

> rhosd4

> 0

> 1

> rhosd5

> 0

> 1

>

>

> (For the record, multiple runs of this reproducer show that it is

> consistently the hot tier that is to blame, but it is not always the same

> replica pair.)

>

>

> Can someone try recreating this scenario to see if the problem is consistent?

> Please reach out if you need me to provide any further details.

>

>

> Dustin Black, RHCA

> Senior Architect, Software-Defined Storage

> Red Hat, Inc.

> (o) +1.212.510.4138 (m) +1.215.821.7423

> dustin@xxxxxxxxxx

>

> _______________________________________________

> Gluster-devel mailing list

> Gluster-devel@xxxxxxxxxxx

> http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel