Re: [PATCH 1/2] block: Add support for atomic writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Quoting Jeff Moyer (2013-11-05 12:43:31)
> Chris Mason <chris.mason@xxxxxxxxxxxx> writes:
> 
> > This allows filesystems and O_DIRECT to send down a list of bios
> > flagged for atomic completion.  If the hardware supports atomic
> > IO, it is given the whole list in a single make_request_fn
> > call.
> >
> > In order to limit corner cases, there are a few restrictions in the
> > current code:
> >
> > * Every bio in the list must be for the same queue
> >
> > * Every bio must be a simple write.  No trims or reads may be mixed in
> >
> > A new blk_queue_set_atomic_write() sets the number of atomic segments a
> > given driver can accept.
> >
> > Any number greater than one is allowed, but the driver is expected to
> > do final checks on the bio list to make sure a given list fits inside
> > its atomic capabilities.
> 
> Hi, Chris,
> 
> This is great stuff.  I have a couple of high level questions that I'm
> hoping you can answer, given that you're closer to the hardware than
> most.  What constraints can we expect hardware to impose on atomic
> writes in terms of size and, um, contiguousness (is that a word)?  How
> do we communicate those constraints to the application?  (I'm not
> convinced a sysfs file is adequate.)
> 
> For example, looking at NVMe, it appears that devices may guarantee that
> a set of /sequential/ logical blocks may be completed atomically, but I
> don't see a provision for disjoint regions.  That spec also
> differentiates between power fail write atomicity and "normal" write
> atomicity.

Unfortunately, it's hard to say.  I think the fusionio cards are the
only shipping devices that support this, but I've definitely heard that
others plan to support it as well.  mariadb/percona already support the
atomics via fusionio specific ioctls, and turning that into a real
O_ATOMIC is a priority so other hardware can just hop on the train.

This feature in general is pretty natural for the log structured squirrels
they stuff inside flash, so I'd expect everyone to support it.  Matthew,
how do you feel about all of this?

With the fusionio drivers, we've recently increased the max atomic size.
It's basically 1MB, disjoint or contig doesn't matter.  We're powercut
safe at 1MB.

> 
> Basically, I'd like to avoid requiring a trial and error programming
> model to determine what an application can expect to work (like we have
> with O_DIRECT right now).

I'm really interested in ideas on how to provide that.  But, with dm,
md, and a healthy assortment of flash vendors, I don't know how...

I've attached my current test program.  The basic idea is to fill
buffers (1MB in size) with a random pattern.  Each buffer has a
different random pattern.

You let it run for a while and then pull the plug.  After the box comes
back up, run the program again and it looks for consistent patterns
filling each 1MB aligned region in the file.

Usage:

	gcc -Wall -o atomic-pattern atomic-pattern.c

	create a heavily fragmented file (exercise for the user, I need
	to make a mode for this)

	atomic-pattern file_name init

	<wait for init done printf to appear>
	<let it run for a while>
	<cut power to the box>
	<box comes back to life>

	atomic-pattern file_name check


In order to reliably find torn blocks without O_ATOMIC, I had to bump
the write size to 1MB and run 24 instances in parallel. 
	
/*
 * Copyright 2013 Fusion-io
 * GPLv2 or higher license
 */

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <errno.h>

#define FILE_SIZE (300 * 1024 * 1024)
#define O_DIRECT	00040000ULL
#define O_ATOMIC	040000000ULL

void set_block_headers(unsigned char *buf, int buffer_size, unsigned long seq)
{
	while (buffer_size > sizeof(seq)) {
	    memcpy(buf, &seq, sizeof(seq));
	    buffer_size -= sizeof(seq);
	    buf += sizeof(seq);
	}
}

int check_block_headers(unsigned char *buf, int buffer_size)
{
	unsigned long seq = 0;
	unsigned long check = 0;
	memcpy(&seq, buf, sizeof(seq));
	buffer_size -= sizeof(seq);

	while (buffer_size > sizeof(seq)) {
		memcpy(&check, buf, sizeof(check));
		if (check != seq) {
			fprintf(stderr, "check failed %lx %lx\n", seq, check);
			return -EIO;
		}
		buffer_size -= sizeof(seq);
		buf += sizeof(seq);
	}
	return 0;
}

int main(int ac, char **av)
{
	unsigned char *file_buf;
	loff_t pos;
	int ret;
	int fd;
	int write_size = 1024 * 1024;
	char *filename = av[1];
	int check = 0;
	int init = 0;

	if (ac < 2) {
		fprintf(stderr, "usage: atomic-pattern filename [check | init]\n");
		exit(1);
	}
	if (ac > 2) {
		if (!strcmp(av[2], "check")) {
			check = 1;
			fprintf(stderr, "checking %s\n", filename);
		} else if (!strcmp(av[2], "init")) {
			init = 1;
			fprintf(stderr, "init %s\n", filename);
		} else {
			fprintf(stderr, "usage: atomic-pattern filename [check | init]\n");
			exit(1);
		}
	}

	ret = posix_memalign((void **)&file_buf, 4096, write_size);
	if (ret) {
		perror("cannot allocate memory\n");
		exit(1);
	}

	fd = open(filename, O_RDWR, 0600);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	ret = fcntl (fd, F_SETFL, O_DIRECT | O_ATOMIC);
	if (ret) {
		perror("fcntl");
		exit(1);
	}
	pos = 0;
	if (!init && !check)
		goto runit;

	while (pos < FILE_SIZE) {
		if (check) {
			ret = pread(fd, file_buf, write_size, pos);
			if (ret != write_size) {
				perror("write");
				exit(1);
			}
			ret = check_block_headers(file_buf, write_size);
			if (ret) {
				fprintf(stderr, "Failed check on buffer %llu\n", (unsigned long long)pos);
				exit(1);
			}
		} else {
			set_block_headers(file_buf, write_size, rand());

			ret = pwrite(fd, file_buf, write_size, pos);
			if (ret != write_size) {
				perror("write");
				exit(1);
			}
		}
		pos += write_size;
	}
	if (check)
		exit(0);

	fsync(fd);
runit:
	fprintf(stderr, "File init done, running random writes\n");

	while (1) {
		pos = rand() % FILE_SIZE;

		pos = pos / write_size;
		pos = pos * write_size;

		if (pos + write_size > FILE_SIZE)
			pos = 0;

		set_block_headers(file_buf, write_size, rand());

		ret = pwrite(fd, file_buf, write_size, pos);
		if (ret != write_size) {
			perror("write");
			exit(1);
		}
	}
	return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux