Possible race condition bug with tiered volume

Dustin Black <dblack@xxxxxxxxxx> · Tue, 18 Oct 2016 16:30:04 -0400

I have a 3x2 hot tier on NVMe drives with a 3x2 cold tier on RAID6 drives.
# gluster vol info 1nvme-distrep3x2

Volume Name: 1nvme-distrep3x2
Type: Tier
Volume ID: 21e3fc14-c35c-40c5-8e46-c258c1302607
Status: Started
Number of Bricks: 12
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 3 x 2 = 6
Brick1: n5:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick2: n4:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick3: n3:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick4: n2:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick5: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot
Brick6: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 3 x 2 = 6
Brick7: n0:/rhgs/coldbricks/1nvme-distrep3x2
Brick8: n1:/rhgs/coldbricks/1nvme-distrep3x2
Brick9: n2:/rhgs/coldbricks/1nvme-distrep3x2
Brick10: n3:/rhgs/coldbricks/1nvme-distrep3x2
Brick11: n4:/rhgs/coldbricks/1nvme-distrep3x2
Brick12: n5:/rhgs/coldbricks/1nvme-distrep3x2
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
performance.readdir-ahead: on

I am attempting to run the 'smallfile' benchmark tool on this volume. The 'smallfile' tool creates a starting gate directory and files in a shared filesystem location. The first run (write) works as expected.

# smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top /rhgs/client/1nvme-distrep3x2 --host-set c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create

For the second run (read), I believe that smallfile attempts first to 'rm -rf' the "network-sync-dir" path, which fails with ENOTEMPTY, causing the run to fail

# smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top /rhgs/client/1nvme-distrep3x2 --host-set c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create
...
Traceback (most recent call last):
  File "/root/bin/smallfile_cli.py", line 280, in <module>
    run_workload()
  File "/root/bin/smallfile_cli.py", line 270, in run_workload
    return run_multi_host_workload(params)
  File "/root/bin/smallfile_cli.py", line 62, in run_multi_host_workload
    sync_files.create_top_dirs(master_invoke, True)
  File "/root/bin/sync_files.py", line 27, in create_top_dirs
    shutil.rmtree(master_invoke.network_dir)
  File "/usr/lib64/python2.7/shutil.py", line 256, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib64/python2.7/shutil.py", line 254, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/rhgs/client/1nvme-distrep3x2/smf1'

From the client perspective, the directory is clearly empty.

# ls -a /rhgs/client/1nvme-distrep3x2/smf1/
.  ..

And a quick search on the bricks shows that the hot tier on the last replica pair is the offender.

# for i in {0..5}; do ssh n$i "hostname; ls /rhgs/coldbricks/1nvme-distrep3x2/smf1 | wc -l; ls /rhgs/hotbricks/1nvme-distrep3x2-hot/smf1 | wc -l"; donerhosd0
0
0
rhosd1
0
0
rhosd2
0
0
rhosd3
0
0
rhosd4
0
1
rhosd5
0
1

(For the record, multiple runs of this reproducer show that it is consistently the hot tier that is to blame, but it is not always the same replica pair.)

Can someone try recreating this scenario to see if the problem is consistent? Please reach out if you need me to provide any further details.

Dustin Black, RHCA  
Senior Architect, Software-Defined Storage
Red Hat, Inc.
(o) +1.212.510.4138  (m) +1.215.821.7423  
dustin@xxxxxxxxxx    

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel