Re: Possible race condition bug with tiered volume

Dan Lambright <dlambrig@xxxxxxxxxx> · Tue, 18 Oct 2016 19:04:46 -0400 (EDT)

Dustin,

What level code ? I often run smallfile on upstream code with tiered volumes and have not seen this.  

Sure, one of us will get back to you.

Unfortunately, gluster has a lot of protocol overhead (LOOKUPs), and they overwhelm the boost in transfer speeds you get for small files. A presentation at the Berlin gluster summit evaluated this.  The expectation is md-cache will go a long way towards helping that, before too long.

Dan

----- Original Message -----
> From: "Dustin Black" <dblack@xxxxxxxxxx>
> To: gluster-devel@xxxxxxxxxxx
> Cc: "Annette Clewett" <aclewett@xxxxxxxxxx>
> Sent: Tuesday, October 18, 2016 4:30:04 PM
> Subject:  Possible race condition bug with tiered volume
> 
> I have a 3x2 hot tier on NVMe drives with a 3x2 cold tier on RAID6 drives.
> 
> # gluster vol info 1nvme-distrep3x2
> Volume Name: 1nvme-distrep3x2
> Type: Tier
> Volume ID: 21e3fc14-c35c-40c5-8e46-c258c1302607
> Status: Started
> Number of Bricks: 12
> Transport-type: tcp
> Hot Tier :
> Hot Tier Type : Distributed-Replicate
> Number of Bricks: 3 x 2 = 6
> Brick1: n5:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Brick2: n4:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Brick3: n3:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Brick4: n2:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Brick5: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Brick6: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Cold Tier:
> Cold Tier Type : Distributed-Replicate
> Number of Bricks: 3 x 2 = 6
> Brick7: n0:/rhgs/coldbricks/1nvme-distrep3x2
> Brick8: n1:/rhgs/coldbricks/1nvme-distrep3x2
> Brick9: n2:/rhgs/coldbricks/1nvme-distrep3x2
> Brick10: n3:/rhgs/coldbricks/1nvme-distrep3x2
> Brick11: n4:/rhgs/coldbricks/1nvme-distrep3x2
> Brick12: n5:/rhgs/coldbricks/1nvme-distrep3x2
> Options Reconfigured:
> cluster.tier-mode: cache
> features.ctr-enabled: on
> performance.readdir-ahead: on
> 
> 
> I am attempting to run the 'smallfile' benchmark tool on this volume. The
> 'smallfile' tool creates a starting gate directory and files in a shared
> filesystem location. The first run (write) works as expected.
> 
> # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top
> /rhgs/client/1nvme-distrep3x2 --host-set
> c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y
> --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create
> 
> For the second run (read), I believe that smallfile attempts first to 'rm
> -rf' the "network-sync-dir" path, which fails with ENOTEMPTY, causing the
> run to fail
> 
> # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top
> /rhgs/client/1nvme-distrep3x2 --host-set
> c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y
> --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create
> ...
> Traceback (most recent call last):
> File "/root/bin/smallfile_cli.py", line 280, in <module>
> run_workload()
> File "/root/bin/smallfile_cli.py", line 270, in run_workload
> return run_multi_host_workload(params)
> File "/root/bin/smallfile_cli.py", line 62, in run_multi_host_workload
> sync_files.create_top_dirs(master_invoke, True)
> File "/root/bin/sync_files.py", line 27, in create_top_dirs
> shutil.rmtree(master_invoke.network_dir)
> File "/usr/lib64/python2.7/shutil.py", line 256, in rmtree
> onerror(os.rmdir, path, sys.exc_info())
> File "/usr/lib64/python2.7/shutil.py", line 254, in rmtree
> os.rmdir(path)
> OSError: [Errno 39] Directory not empty: '/rhgs/client/1nvme-distrep3x2/smf1'
> 
> 
> From the client perspective, the directory is clearly empty.
> 
> # ls -a /rhgs/client/1nvme-distrep3x2/smf1/
> . ..
> 
> 
> And a quick search on the bricks shows that the hot tier on the last replica
> pair is the offender.
> 
> # for i in {0..5}; do ssh n$i "hostname; ls
> /rhgs/coldbricks/1nvme-distrep3x2/smf1 | wc -l; ls
> /rhgs/hotbricks/1nvme-distrep3x2-hot/smf1 | wc -l"; donerhosd0
> 0
> 0
> rhosd1
> 0
> 0
> rhosd2
> 0
> 0
> rhosd3
> 0
> 0
> rhosd4
> 0
> 1
> rhosd5
> 0
> 1
> 
> 
> (For the record, multiple runs of this reproducer show that it is
> consistently the hot tier that is to blame, but it is not always the same
> replica pair.)
> 
> 
> Can someone try recreating this scenario to see if the problem is consistent?
> Please reach out if you need me to provide any further details.
> 
> 
> Dustin Black, RHCA
> Senior Architect, Software-Defined Storage
> Red Hat, Inc.
> (o) +1.212.510.4138 (m) +1.215.821.7423
> dustin@xxxxxxxxxx
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel