FLEX feature requests

Phillip Chen <phillip.a.chen@xxxxxxxxxxx> · Fri, 15 Dec 2017 14:29:33 -0700

Hello again,
I've been working on getting FIO running on a FLEX drive and I've been
accumulating a laundry list of features that would be nice to have and
I've run into one that is necessary. I also found one bug while
experimenting with various FIO options that I'll include in the list.
Here are the changes I'd appreciate seeing starting with the most
desired. I've included

1. --ignore_device_size
The current FLEX protocol maps storage to sectors/bytes greater than
the reported capacity of the drive. I'd like to run FIO on these out
of bounds sectors, but right now I can't because I get the error "you
need to specify valid offset=". Would it be possible to add a flag
that would let users run IO outside of the reported device capacity?
Without this I believe that FIO cannot run on the SMR portions of any
FLEX drive.

2. --readwrite=write_pointer_randwrite:offset,io_size:offset,io_size...
In write pointer zones, all writes must be at the write pointer, so
random IO is not possible. However, it would be useful to run
random-like workloads in which random zones are written to
sequentially. This would be very similar to the
random_distribution=zoned_abs argument except for instead of writing
randomly within a zone, it would write to a specific offset and then
increment that offset by the amount written, but would stop writing to
that zone entirely once the incrementing offset reached the end of the
zone. So if I had 3 zones I wanted to write to that were each 128mb
long, I could specify something like
--readwrite=write_pointer_randwrite:0,128m:128m,128m:256m,128m. It
might also make sense to add distribution percentages like in the
zoned_abs argument, although I'm not entirely sure what you would do
with those percentages when a zone got fully written and thus could
not be picked anymore.

3. --readwrite=write_pointer_randrw:offset,io_size:offset,io_size...
Additionally, in write pointer zones, all reads must be below the
write pointer, so random read IO is restricted. This is why I
requested the random_distribution:zoned_abs argument, because that can
be used quite well for issuing random reads to write pointer zones.
However, I would like to read and write "randomly" to write pointer
zones so that I can more easily control the read/write ratio as well
as being able to read data that was written during the same FIO run
(currently I can use random_distribution:zoned_abs to read randomly
from the beginning of the zone up to the write pointer at the
beginning of the FIO run, but I cannot read further after FIO
increments the write pointer). This workload would write randomly as
described above, and read between the offset and the incremented
offset. So before any writes went to the zone, you wouldn't be able to
read randomly from that zone.

4. Automatic zone detection with the above two readwrite modes
I believe this would be quite a bit of work, but it would nice to be
able to specify the previous two workload types without specifically
specifying the zones. Instead the user could specify offset and size
as normal, and additionally specify the zone number (perhaps through a
new option or perhaps with an extended syntax in the readwrite
option), and FIO would get the zones and randomly perform write
pointer legal IO within all the zones specified by the user using
offset and size. And if the user specified a drive area that contains
non-write pointer zones, FIO would just do normal IO. It might also be
possible for me to help with the implementation of this, if that would
be something you'd be interested in.

5. Bug report: --random_percentage sequential behaviour
It seems that sequential IO is increasing but not sequential when
using the --random_percentage option. Running the following FIO job:
fio --name=rand_reads_seq_writes --ioengine=libaio --direct=1
--exitall --thread --filename=/dev/sdf --runtime=30 --readwrite=randrw
--iodepth=1 --percentage_random=100,0 --norandommap
--output-format=terse
results in an even distribution of reads as expected and writes that
are increasing but not sequential. Here's an example of writes that I
am seeing running this job:
First 20 writes (sector, sectors written)
[(0, 8), (3048, 8), (3056, 8), (3064, 8), (3072, 8), (3080, 8), (3088,
8), (6408, 8), (7000, 8), (13440, 8), (13496, 8), (13648, 8), (13768,
8), (13920, 8), (14288, 8), (14400, 8), (16376, 8), (18824, 8),
(18936, 8), (19832, 8)]
Here is my environment information:
# cat /etc/centos-release
CentOS Linux release 7.3.1611 (Core)
# uname -r
3.10.0-514.21.1.el7.x86_64
I saw the same behaviour on fio-3.2 and fio-3.2-81-g3e262 which was
the newest version I could see as of today.

So I see some bursts of sequential writes, but mostly it seems to be
skipping around.

I've attached a python 3.6 script that will run this workload and
collect the IO information using blktrace/blkparse. To run the script,
use the -h flag to see usage, but at a minimum you'll need to give the
device handle to run on as the first argument.

Thank you for your help, and let me know if you decide to add these
features or if I need to provide any further information.
import re
import subprocess
import time
import sys
import math
import argparse

arg_parser = argparse.ArgumentParser()
arg_parser.add_argument("drive_handle", help = "Drive handle to test")
arg_parser.add_argument("-rt", "--runtime", default = 30, help = "Time to run workload")
arg_parser.add_argument("-sbp", "--save_block_parse", action = "store_true",
                        help = "Save blockparse output to blkparse_output.txt if flag is set")
arg_parser.add_argument("-fp", "--fio_path", default = "fio",
                        help = "The path to the FIO executable to run")
args = arg_parser.parse_args()

dev_handle = args.drive_handle

blktrace = subprocess.Popen(["blktrace", dev_handle, "-o", "-"], stdout = subprocess.PIPE,
                            stderr = subprocess.PIPE)
# blktrace needs a little time to get set up
time.sleep(1)

# Start FIO job
fio_string = (args.fio_path + " --name=rand_reads_seq_writes --ioengine=libaio --direct=1 --exitall"
              " --thread --filename=" + dev_handle + " --runtime=" + str(args.runtime) +
              " --readwrite=randrw --iodepth=1 --percentage_random=100,0 --norandommap "
              "--output-format=terse")
print("Running " + fio_string)
cmd_ret = subprocess.run(fio_string.split(' '), stdout = subprocess.PIPE, stderr = subprocess.PIPE)

if cmd_ret.stderr != b"":
    print("FIO errors:")
    print(cmd_ret.stderr.decode(sys.stderr.encoding))
    print("FIO stdout:")
    print(cmd_ret.stdout.decode(sys.stdout.encoding))

# Terminate is how blktrace expects to end, don't use kill or you'll lose commands near the end
blktrace.terminate()
try:
    stdout, stderr = blktrace.communicate(timeout = 20)
except subprocess.TimeoutExpired:
    blktrace.kill()
    stdout, stderr = blktrace.communicate()
print("blktrace errors:")
print(stderr)
# This will give you the raw blktrace output
# print(stdout)
blkparse_format_str = '%D %2c %8s %5T.%9t %5p %2a %3d command = %C sectors = %S block_num = %n\n'
blkparse_ret = subprocess.run(["blkparse", "-i", "-", "-f", blkparse_format_str, "-a", "issue"],
                              input = stdout, stdout = subprocess.PIPE, stderr = subprocess.PIPE)
print("blkparse errors:")
print(blkparse_ret.stderr)
# print(blkparse_ret.stdout)
blkparse_str = blkparse_ret.stdout.decode(sys.stdout.encoding)

if args.save_block_parse:
    with open("blkparse_output.txt", 'w') as output_file:
        output_file.write(blkparse_str)

# Parse blktrace result into bins
blkline_re = re.compile(r"(\d+,\d+)\s+(\d+)\s+(\d+)\s+(?P<timestamp>\d+\.\d+)\s+(\d+)\s+D\s+"
                        r"(?P<type>(R|W)S?)\s+command = fio\s+sectors = (?P<sector>\d+)\s"
                        r"block_num = (?P<block_num>\d+)")
total_reads = 0
avg_lba = 0
max_lba = 0
min_lba = None
# Parse out the sectors from the blocktrace output to get some preliminary statistics
match_iter = blkline_re.finditer(blkparse_str)
for match_obj in match_iter:
    if match_obj.groupdict()["type"][0] == 'R':
        sector_num = int(match_obj.groupdict()["sector"])
        total_reads += 1
        avg_lba += sector_num
        if min_lba is None or sector_num < min_lba:
            min_lba = sector_num
        if sector_num > max_lba:
            max_lba = sector_num

print("total reads = " + str(total_reads))
print("avg: {:.2f}, min: {}, max: {}".format(avg_lba / total_reads, min_lba, max_lba))
hist_num = 20
hist_bins = [0] * hist_num
hist_div = max_lba / hist_num
hist_edges = []
for ind in range(hist_num):
    hist_edges.append(hist_div * (ind + 1))

# Sort the data into a read histogram and a write list
write_list = []
match_iter = blkline_re.finditer(blkparse_str)
for match_obj in match_iter:
    sector_num = int(match_obj.groupdict()["sector"])
    if match_obj.groupdict()["type"][0] == 'R':
        hist_ind = math.floor(sector_num / hist_div)
        if hist_ind == hist_num:
            hist_ind -= 1
        hist_bins[hist_ind] += 1
    # Assume non-reads are writes
    else:
        block_num = int(match_obj.groupdict()["block_num"])
        write_list.append((sector_num, block_num))

hist_perc = []
for hist_bin in hist_bins:
    hist_perc.append(100 * hist_bin / total_reads)

print("read histogram bins = " + str(hist_bins))
print("read histogram percents = " + str(hist_perc))
print("read histogram edges = " + str(hist_edges))
num_write_print = 20
print("First {} writes (sector, sectors written)".format(num_write_print))
print(write_list[:num_write_print])
# print FIO version
cmd_ret = subprocess.run([args.fio_path, "-v"])