Fwd: Fio Checksum tracking and enhanced trim workloads

Jens Axboe <axboe@xxxxxxxxx> · Mon, 8 May 2017 08:18:02 -0600

Paul is having trouble sending this to the reflector, let's see
if this works.

-------- Forwarded Message --------
Subject: 	Fio Checksum tracking and enhanced trim workloads
Date: 	Sun, 7 May 2017 23:54:16 -0400
From: 	paul houlihan <phoulihan9@xxxxxxxxx>
To: 	fio@xxxxxxxxxxxxxxx, Jens Axboe <axboe@xxxxxxxxx>

I have a submission for fio that enhances the data corruption detection and diagnosis capabilities taking fio from pretty good corruption detection to absolute guarantees. I would like these changes on the tracking branch???? to be reviewed and considered for inclusion to fio. A quick review would be helpful as I am losing access to test systems shortly.

These changes were used by a Virtual Machine caching company to assure data integrity. Most testing was on Linux 64 bits and windows 32/64 bits. The windows build still had an issue with compile time asserts in libfio.c that I worked around by commenting out the asserts as this looked like a performance restriction. This should be researched more. The initial development was on version fio 2.2.10 sources and I just ported the changes to fio latest sources and tested on linux but haven’t yet test on windows. No testing on all other fio supported OSes was done, although the changes are almost exclusively to OS independent code.

The absolute guarantees are brought about by tracking checksums to prevent a stale but intact prior version of a block being returned and by verifying all reads. I was surprised to learn about the number of times fio performed concurrent I/O to the same blocks which yields indeterminate results that prevent data integrity verification. Thus a number of options are not supported when tracking is enabled. 

Finally I have enhanced the usage of trims and am able to verify data integrity of these operations in an integrated fashion.

Here is a list of changes in this submission:

 * Bug where expected version of verify_interval is not generated correctly, dummy io_u not setup correctly

 * Bug where unknown header_interval referenced in HOWTO, fixed a bunch of typos.

 * Bug where windows hangs on nano sleep in windows 7.

 * Also stonewall= option does not seem to work on windows 7, seems fixed in later releases so painfully worked around this by having separate init and run fio scripts. No change was made here but just mentioning this in passing.

 * Fixed bug where FD_IO logging was screwed up in io_c.h. Here is example of logging problem:

io       2212  io complete: io_u 0x787280: off=1048576/len=2097152/ddir=0io       2212  /b.datio       2212  

io       2212  fill_io_u: io_u 0x787280: off=3145728/len=2097152/ddir=1io       2212  /b.datio       2212  

io       2212  prep: io_u 0x787280: off=3145728/len=2097152/ddir=1io       2212  /b.datio       2212  

io       2212  ->prep(0x787280)=0

io       2212  queue: io_u 0x787280: off=3145728/len=2097152/ddir=1io       2212  /b.datio       2212  

 * In order to make fio into an superb data integrity test tool, a number of shortcomings were addressed. New verify_track switch enables in memory tracking of checksums within each fio job, preventing a block from rolling back to prior version. The in memory checksums can be written to a tracking log file to provide an absolute checksum guarantees between fio jobs or between fio runs. Verification of trim operations is supported in an integrated fashion. See HOWTO description of verify_tracking. verify_tacking_log, verify_tracking_required, verify_tracking_dir, verify_trim_zero

 * Enhanced description surrounding corruption added to HOWTO as well as providing some corruption analyze tools.

 * Bad header will dump received buffer into *.received before you gave you an error message 

 * If verify_interval is less than the block size, fio will now always dump the complete buffer in an additional file called *.complete. Seeing whole buffer can reveal more about the corruption pattern.

 * Changed the printing of the hex checksum to display in MSB to LSB order to facilitate compares to memory dumps and debug logging

 * Added a dump of the complete return buffer on trim write verification failure. 

 * Debug logging was being truncated at the end of a job so you could not see the full set of debug log messages, so added a log flush at the end of each job if debug= switch is used.

 * rw=readwrite seems to have independent last_pos read/write pointers as you sequentially access the file. If the mix is 50/50 then you could have fio reading and writing the same block as the read and write pointer cross each other which is not reliably verifiable. This pattern result is chaos and contradicts all the other sequential patterns and even randrw. Overlapping I/O makes little sense and is usually a sign of a broken application. Moreover readwrite workload would not complete a sequential pass over the entire file which everyone I spoke to assumed it was doing. So a change was made to the existing read/write workload functionality. Now the max of the file’s last_pos pointers for DDIR_READ and DDIR_WRITE are used for selecting the next offset as we sequentially scan a file. If the old behavior is somehow useful then an option can be added to preserve it. If preserved, it should never be the default and should disable verification.

My changes revolve around maintaining the last_pos array in a special way. When multiple operations (read/write/trim) are requested by a workload then as the last position is changed, the changes are reflected in all three entries in the array. This way a randomly selected next operation always use the right last_pos. However we retained the old behavior for single operation workloads and for trimwrite which operates like a single operation workload.

 * Synchronous Trim I/O completions were not updating bytes_issued in backend.c and thus trimwrite was actually making 2 passes of the file.

 * I kept the new verify_tracking verification entirely separate from the experimental_verify code. These new tracking changes provides fully persistent verification of trims integrated into standard verify, so we might want to consider deprecating support for experimental_verify. Note that verify_track and experimental_verify cannot both be enabled.

 * With the wide adoption of thin LUN datastores and recent expanded OS support for trim operations to reclaim unused space, testing trim operations in a wide variety of contexts has been a necessity. Added some new trim I/O workloads to the existing trim workloads, that require use of verify_tracking option to verify:

trimSequential trims

readtrimSequential mixed reads and trims

writetrimSequential mixed writes and trims.

Each block will be trimmed or written.

readwritetrimSequential mixed reads/writes/trims

randtrimRandom trims

randreadtrimRandom mixed reads and trims

randwritetrimRandom mixed writes and trims

randrwtRandom mixed reads/writes/trims

 * A second change to existing fio functionality involves an inconsistency of counting read verification bytes against the size= argument. Some rw= workloads count read verification I/Os or bytes against size= values (like readwrite and randrw) and some do not  like write, trim and trimwrite. Counting read verifications bytes makes it hard to predict the number of bytes or I/Os that will be performed in the readwrite workload and the new rw= workloads increases the unpredictability with even more read verifications in a readwritetrim workload. Normally I expect that fio should process all the bytes in a file pass but when the bytes from read verifies count towards the total bytes to process in size=, only part of the file is processed. So I made it consistent for size and io_limit by not counting read verify bytes. One could argue that number_os= could also be similarly changed but I left this alone and it still uses raw I/O counts which include read verification I/Os.
Another justification is that this_io_bytes never records verification reads for the dry_run and we need dry_run and do_io to be in synch. Note this explains why I removed code to add extra bytes to total_bytes in do_io for verify_backlog. 

 * Seems like the processing of TD_F_VER_NONE is backwards from its name. If verify != VERIFY_NONE then the bit is set but the name implies it should be clear. So now it sets the bit only if verify == VERIFY_NONE to avoid this very confusing state.

 * Added a sync and invalidate after the close in iolog.c ipo_special(). This is needed if you capture checksums in the tracking log and there is a close followed immediately by an open. The close is not immediate if you have iodepth set to a large number. The file is still marked “open” but “closing” on return from the close  and will close only after the last I/O completes. The sync avoids the assert on trying to open an already open file which has a close pending.

 * —read_iolog does not support trims at this time.

 * io_u.c get_next_seq_offset() seems to suggest that ddir_seq_add can be negative but there are a number of unhandled cases with such a setting. Add TODOs to document issues. I have a number of reservations about the correctness of get_next_seq_offset(). Note whenever I saw a possible problem in the code but did not have time to research it, I added a TODO comment.

 * io_u.c get_next_seq_offset() has a problem when it uses fixed value when relative values are what is being manipulated, so this code:

if (pos >= f->real_file_size)

pos = f->file_offset;

should be:

if (pos >= f->io_size)

pos = 0;

 * Given there are a couple of changes to existing fio workload behavior, you might want to consider going to a V3.0. 

Here are two new sections on Verification Tracking and Data Corruption Troubleshooting from HOWTO:

Verification Tracking

---------------------

Absolute data integrity guarantee is the primary mission of a storage

software/hardware subsystem. Fio is good at detecting data corruption but

there are gaps. Currently only when rw option is set to read only are

workload reads verified. It is desirable to validate all reads in addition

to writes to protect against data rolling back to earlier versions.

With the addition of the block's offset to the header in recent fio releases,

block data returned for another block will be flagged as corrupt. However

a limitation of the fio header and data embedded checksums is that fio cannot

detect if a prior intact version of a block was returned on a read. If the

header and data checksum match the block is declared valid.

These limitations can be addressed by setting the verify_track option which

allocates a memory array to track the header and data checksums to assure

data integrity is absolute. The array starts out empty at the beginning of

each fio job and is filled in as reads or writes occur, once defined the

checksums from succeeding I/Os must all match. This option extends checksum

verification to all reads in all workloads, not just the read-only workloads.

However use of verify_track requires that fio avoid overlapping, concurrent

reads and writes to the same block. Reading and writing a block at the same

time yields indeterminate results and making guaranteeing data integrity

impossible. So some fio options where this is a risk are disabled when using

verify_track. See verify_track argument for list of restrictions.

Even better verification would validate data more persistently. You would

like to track checksums persistently between fio jobs or between runs of fio

which could be after a shutdown/restart of the system or on a different system

that shares storage. Proving seamless data integrity from the application

perspective over complex failover and recovery situations like reverting a

virtual machine to a prior snapshot is quite valuable.

Also the popularity of thin LUNs in the storage world has caused problems

if the unused disk space is not reclaimed by use of trims. So we would like

to have the ability to mix and match trims with reads and writes. The rw option

now supports a full set of combinations and the rwtmix=read%,write%,trim% option

allows specifying the mix percentages of all three types of I/O in one argument.

However trims do have special requirements as documented under the rw option.

Finally we would like to verify trims operations. If you read a trimmed block

before re-writing the block, it should return a block of zeroes.

The verify_track_log option permits persistent checksum tracking and

verification of trims by enabling the saving of the tracking array to a tracking

log on the close of a data file at the end of a fio job and reading it back in

at the next start. A clean shutdown of fio is needed for tracking log to be

persistent. When no errors occur checksum context is automatically preserved

between fio jobs and fio runs. On revert of a virtual machine snapshot if

the tracking log is restored from the time of the snapshot then checksum

context is again preserved. There is a tracking log for each data file.

Tracking log filename format is: [dir] / [filename].tracking.log

where:

   filename - is name of file system file or block device name like “sdb”

   dir - is log directory that defaults to directory of data file.

         For block devices, dir defaults to the process current default

         directory.

The tracking log is plain text and contains data from when it was first created:

the data file name it is tracking, the size of the data file, the starting

file offset for I/Os, its verify_interval option setting. From the last

save of the log it has: timestamp of last save and a checksum of the

tracking log contents. For checksums, Bit 0 = 1 defines a valid checksum.

Bit 0 = 0 signifies special case entries (dddddddc indicates a trimmed block

and 0 indicates an undefined entry).

Tracking Log Example with "--" comments added:

$ cat xxx.tracking.log

Fio-tracking-log-version: 1

DataFileName: xxx

DataFileSize: 2048

DataFileOffset: 0

DataFileVerifyInterval: 512

TrackingLogSaveTimestamp: 2017-02-23T14:25:32.446981

TrackingLogChecksum: cae34cd8

VerifyIntervalChecksums:

4028ab33    -- Checksums from read or write of 3 blocks, Bit 0 = 1

a450bffb

81858a3

dddddddc    -- Means trimmed block, Bit 0 = 0

0           -- Means undefined entry never been accessed, Bit 0 = 0

$

Tracking arguments are:

verify_track=bool - enables checksum tracking in memory

verify_track_log=bool - enable savings and restoring of tracking log

verify_track_required=bool - By default fio will create a log on the fly.

    If a log is found at the start it is read and then the log file is deleted.

    If any error occurs during the fio run then the tracking log is not

    written on close so compromised logs do not cause false failures. However

    testing requiring absolute data integrity guarantees will want to use this

    option to require that the tracking log always be present between fio jobs

    or at the start of a new fio run.

verify_track_dir=str - Specifies dir to place all tracking logs. It is advisable

    when evaluating the data integrity of device to place the tracking log on a

    different, more trusted device.

verify_track_trim_zero=bool - When no tracking array entry exists, this option

    allows a zeroed block from prior fio run to be treated as previously trimmed

    instead of as data corruption. Once the array entry for a block is defined,

    this option is no longer used as the array entry determines the required

    verification.

debug=chksum - a new debug option allows tracing of all checksum entry

    additions/changes to the tracking array or entry use in verification

There are a couple considerations to be aware of when using tracking log.

Tracking log is sticky. If you change the following options that make

the tracking log no longer match the data layout then you will receive

a persistent error until the tracking log is recreated: size= or offset=

or verify_interval= options. You do get a friendly error indicating

what tracking log file to delete to start with a fresh tracking log. Note

if a fio run an fails with other errors, the tracking log is discarded so that

stale checksums do not cause false failures on subsequent runs.

The tracking log uses 4 bytes for tracking each verify_interval block

in the data file or block device as specified by 4*(size/verify_interval).

So there are scaling implications for memory usage and log file size.

However blocks are only tracked for the active I/O range from:

offset - (offset+size-1).

The performance impact of the few extra I/Os to read and write the tracking log

between fio jobs and fio runs is negligible since one is not usually verifying

data when doing performance studies. There is no overhead when verify tracking

is disabled and no extra I/Os when verify_track_log is disabled.

Data Corruption Troubleshooting

-------------------------------

When a corruption occurs immediate analysis can reveal many clues as to the

source of the corruption. Is the corruption persistent? In memory and on disk?

The exact pattern of the corruption is often revealing: At the beginning of

an I/O block? Sector aligned? All zeroes or garbage? What is the exact range

of the corruption? Is corruption a stale but intact prior version of the

block?

When a corruption is detected, three possible corrupt data files are created:

*.received - the corrupt data which is possibly a verify_interval block within

              the full block used in the I/O.

*. complete - the full block used in the I/O

*. expected - if the block's header is intact, the expected data pattern for

              the *.received block can be generated

Two scripts exist in the analyze directory to assist in analysis:

corruption_triage.sh - a bash script that contains a sequence of diagnostic

              steps

fio_header.py - a python script that displays the contents of the block header

              in a corrupt data file.

Here are the related parameter descriptions from HOWTO:

option verify_track=bool

Fio normally verifies data within a verify_intervalwith checksums and file

offsets embedded in the data. However a prior version of a block could be

returned and verified successfully. When verify_track is enabled the checksum

for every verify_interval in the file is stored in a table and all read data

must match the checksums in the table. The tracking table is sized as

(size / verify_interval) * 4 bytes. For very large size= option settings,

such a large memory allocation may impact testing. Reads assume that the entire

file has been previously written with a verification format using the same

verify_interval. When verify_track is enabled, all reads are verified, whether

writes are present in the workload or not. Sharing files by threads within a job

is supported but not between jobs running concurrently so use the stonewall

option when more than one non-global job is present. Verify of trimmed blocks

is described for the verify_track_trim_zero option. When disabled, fio falls

back on verification described under the verify option. The restrictions when

enabling the verify_track option are:

- randommap is required

- softrandommap is not supported

- lfsr random generator not supported when using multiple block sizes

- stonewall option required when more than one job present

- file size must be an even multiple of the block size when iodepth > 1

- verify_backlog not supported when iodepth > 1

- verify_async is not supported

- file sharing between concurrent jobs not supported

- numjobs must be 1

- io_submit_mode must be set to "inline"

- verify=null or pattern are not supported

- verify_only is not supported

- io_submit_mode must be set to 'inline'

- supplying a sequence number with rw option is not supported

- experimental_verify is not supported

Defaults to off.

You can enable verify_track for individual jobs and each job will start with

a empty table which is filled in as each block is initially read or written and

enforced on subsequent reads within the job. For persistent tracking of checksums

between jobs or fio runs, see verify_track_log.

option verify_track_log=bool

If set when verify_track is set then on a clean shutdown, fio writes the checksum

for each data block that has been read or written to a log named

(datafilename).tracking.log. If set when fio reopens this data file and a tracking

log exists then the checksums are read into the tracking table and used to validate

every subsequent read. This allows rigorous validation of data integrity as data

files are passed between fio jobs or over the termination of fio and restart on

the same system or on another system or after an OS reboot. Reverting a virtual

machine to a snapshot can be tested by saving the tracking log after a successful

fio run and later restoring the saved log after reverting the virtual machine.

The log is deleted after being read in, so on abnormal termination no stale

checksums can be used. This option, the data file size and verify_interval

parameters should not change between jobs in the same run or on restart of fio.

Defaults to off. verify_track_dir defines the tracking log's directory.

option verify_track_required=bool

If set when verify_track_log is set then the tracking log for each file must exist

at the start of a fio job or an error is returned. Defaults to off which is

the case for the first job in a new fio run. Subsequent jobs in this run can

require use of the tracking log. If set to off then any tracking log found will be

used otherwise an empty tracking table is used. If a prior fio run created a

tracking log for the data file then all jobs can require use of the tracking log.

option verify_track_dir=str

If verify_track_log is set then this defines the single directory for all tracking

logs. The default is to use the same directory where each data file resides.

When filename points to a block device or pipe then the directory defaults to the

current process default directory. To assure data integrity of the tracking log,

each tracking log also contains its own checksum. However when checking a device

for data integrity it is advisable to place tracking logs containing checksums on

a different, more trusted device.

option verify_track_trim_zero=bool

Typically a read of a trimmed block that has not been re-written will return a block

of zeros. If set with verify_tracking enabled then all zeroed blocks with no tracking

information are assumed to have resulted from a trim. If clear zeroed blocks are

treated as corruption. If your device does not return zeroed blocks for reads after

a trim then it cannot participate in tracking verification. Fio sets to 1 if trims

are present in the rw argument and defaults 0 otherwise. You would only use this when

verify_tracking is enabled, trims are not specified in the rw argument and a prior

fio job or run had performed trims.

option readwrite=str, rw=str

Type of I/O pattern. Accepted values are:

read

Sequential reads.

write

Sequential writes.

randwrite

Random writes.

randread

Random reads.

rw,readwrite

Sequential mixed reads or writes.

randrw

Random mixed reads or writes.

Trim I/O has several requirements:

- File system and OS support varies but Linux block devices

  accept trims. You need privilege to write to a Linux block device.

  See example fio: track-mem.fio

- Often minimal block size required. Linux on VMware requires

  at least 1 MB in size aligned on 1 MB boundary

- VMware requires minimum VM OS hardware level 11

- To verify the trim I/Os requires verify_track

Trim I/O patterns are:

trim

Sequential trims

readtrim

Sequential mixed reads or trims

trimwrite

Sequential mixed trim then write. Each block

will be trimmed first, then written to.

writetrim

Sequential mixed writes or trims.

Each block will be trimmed or written.

rwt,readwritetrim

Sequential mixed reads/writes/trims

randtrim

Random trims

randreadtrim

Random mixed reads or trims

randwritetrim

Random mixed writes or trims

randrwt

Random mixed reads/writes/trims

Fio defaults to read if the option is not specified.  For the mixed I/O

types, the default is to split them 50/50.  For certain types of I/O the

result may still be skewed a bit, since the speed may be different. It is

possible to specify a number of I/O's to do before getting a new offset,

this is done by appending a ``:[nr]`` to the end of the string given.  For a

random read, it would look like ``rw=randread:8`` for passing in an offset

modifier with a value of 8. If the suffix is used with a sequential I/O

pattern, then the value specified will be added to the generated offset for

each I/O.  For instance, using ``rw=write:4k`` will skip 4k for every

write. It turns sequential I/O into sequential I/O with holes.  See the

:option:`rw_sequencer` option. Storage array vendors often require

trims to use a minimum block size.

option rwtmix=int[,int][,int]

When trims along with reads and/or writes are specified in the rw option then

this is the preferred argument for specifying mix percentages. The argument is

of the form: read,write,trim and the percentages must total 100.  Note any

argument may be empty to leave that value at its default from rwmix* arguments

of 50,50,0. If a trailing comma isn't given, the remainder will inherit

the last value set.

--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html