Dustin, What level code ? I often run smallfile on upstream code with tiered volumes and have not seen this. Sure, one of us will get back to you. Unfortunately, gluster has a lot of protocol overhead (LOOKUPs), and they overwhelm the boost in transfer speeds you get for small files. A presentation at the Berlin gluster summit evaluated this. The expectation is md-cache will go a long way towards helping that, before too long. Dan ----- Original Message ----- > From: "Dustin Black" <dblack@xxxxxxxxxx> > To: gluster-devel@xxxxxxxxxxx > Cc: "Annette Clewett" <aclewett@xxxxxxxxxx> > Sent: Tuesday, October 18, 2016 4:30:04 PM > Subject: Possible race condition bug with tiered volume > > I have a 3x2 hot tier on NVMe drives with a 3x2 cold tier on RAID6 drives. > > # gluster vol info 1nvme-distrep3x2 > Volume Name: 1nvme-distrep3x2 > Type: Tier > Volume ID: 21e3fc14-c35c-40c5-8e46-c258c1302607 > Status: Started > Number of Bricks: 12 > Transport-type: tcp > Hot Tier : > Hot Tier Type : Distributed-Replicate > Number of Bricks: 3 x 2 = 6 > Brick1: n5:/rhgs/hotbricks/1nvme-distrep3x2-hot > Brick2: n4:/rhgs/hotbricks/1nvme-distrep3x2-hot > Brick3: n3:/rhgs/hotbricks/1nvme-distrep3x2-hot > Brick4: n2:/rhgs/hotbricks/1nvme-distrep3x2-hot > Brick5: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot > Brick6: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot > Cold Tier: > Cold Tier Type : Distributed-Replicate > Number of Bricks: 3 x 2 = 6 > Brick7: n0:/rhgs/coldbricks/1nvme-distrep3x2 > Brick8: n1:/rhgs/coldbricks/1nvme-distrep3x2 > Brick9: n2:/rhgs/coldbricks/1nvme-distrep3x2 > Brick10: n3:/rhgs/coldbricks/1nvme-distrep3x2 > Brick11: n4:/rhgs/coldbricks/1nvme-distrep3x2 > Brick12: n5:/rhgs/coldbricks/1nvme-distrep3x2 > Options Reconfigured: > cluster.tier-mode: cache > features.ctr-enabled: on > performance.readdir-ahead: on > > > I am attempting to run the 'smallfile' benchmark tool on this volume. The > 'smallfile' tool creates a starting gate directory and files in a shared > filesystem location. The first run (write) works as expected. > > # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top > /rhgs/client/1nvme-distrep3x2 --host-set > c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y > --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create > > For the second run (read), I believe that smallfile attempts first to 'rm > -rf' the "network-sync-dir" path, which fails with ENOTEMPTY, causing the > run to fail > > # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top > /rhgs/client/1nvme-distrep3x2 --host-set > c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y > --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create > ... > Traceback (most recent call last): > File "/root/bin/smallfile_cli.py", line 280, in <module> > run_workload() > File "/root/bin/smallfile_cli.py", line 270, in run_workload > return run_multi_host_workload(params) > File "/root/bin/smallfile_cli.py", line 62, in run_multi_host_workload > sync_files.create_top_dirs(master_invoke, True) > File "/root/bin/sync_files.py", line 27, in create_top_dirs > shutil.rmtree(master_invoke.network_dir) > File "/usr/lib64/python2.7/shutil.py", line 256, in rmtree > onerror(os.rmdir, path, sys.exc_info()) > File "/usr/lib64/python2.7/shutil.py", line 254, in rmtree > os.rmdir(path) > OSError: [Errno 39] Directory not empty: '/rhgs/client/1nvme-distrep3x2/smf1' > > > From the client perspective, the directory is clearly empty. > > # ls -a /rhgs/client/1nvme-distrep3x2/smf1/ > . .. > > > And a quick search on the bricks shows that the hot tier on the last replica > pair is the offender. > > # for i in {0..5}; do ssh n$i "hostname; ls > /rhgs/coldbricks/1nvme-distrep3x2/smf1 | wc -l; ls > /rhgs/hotbricks/1nvme-distrep3x2-hot/smf1 | wc -l"; donerhosd0 > 0 > 0 > rhosd1 > 0 > 0 > rhosd2 > 0 > 0 > rhosd3 > 0 > 0 > rhosd4 > 0 > 1 > rhosd5 > 0 > 1 > > > (For the record, multiple runs of this reproducer show that it is > consistently the hot tier that is to blame, but it is not always the same > replica pair.) > > > Can someone try recreating this scenario to see if the problem is consistent? > Please reach out if you need me to provide any further details. > > > Dustin Black, RHCA > Senior Architect, Software-Defined Storage > Red Hat, Inc. > (o) +1.212.510.4138 (m) +1.215.821.7423 > dustin@xxxxxxxxxx > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel