Quoting Jeff Moyer (2013-11-05 12:43:31) > Chris Mason <chris.mason@xxxxxxxxxxxx> writes: > > > This allows filesystems and O_DIRECT to send down a list of bios > > flagged for atomic completion. If the hardware supports atomic > > IO, it is given the whole list in a single make_request_fn > > call. > > > > In order to limit corner cases, there are a few restrictions in the > > current code: > > > > * Every bio in the list must be for the same queue > > > > * Every bio must be a simple write. No trims or reads may be mixed in > > > > A new blk_queue_set_atomic_write() sets the number of atomic segments a > > given driver can accept. > > > > Any number greater than one is allowed, but the driver is expected to > > do final checks on the bio list to make sure a given list fits inside > > its atomic capabilities. > > Hi, Chris, > > This is great stuff. I have a couple of high level questions that I'm > hoping you can answer, given that you're closer to the hardware than > most. What constraints can we expect hardware to impose on atomic > writes in terms of size and, um, contiguousness (is that a word)? How > do we communicate those constraints to the application? (I'm not > convinced a sysfs file is adequate.) > > For example, looking at NVMe, it appears that devices may guarantee that > a set of /sequential/ logical blocks may be completed atomically, but I > don't see a provision for disjoint regions. That spec also > differentiates between power fail write atomicity and "normal" write > atomicity. Unfortunately, it's hard to say. I think the fusionio cards are the only shipping devices that support this, but I've definitely heard that others plan to support it as well. mariadb/percona already support the atomics via fusionio specific ioctls, and turning that into a real O_ATOMIC is a priority so other hardware can just hop on the train. This feature in general is pretty natural for the log structured squirrels they stuff inside flash, so I'd expect everyone to support it. Matthew, how do you feel about all of this? With the fusionio drivers, we've recently increased the max atomic size. It's basically 1MB, disjoint or contig doesn't matter. We're powercut safe at 1MB. > > Basically, I'd like to avoid requiring a trial and error programming > model to determine what an application can expect to work (like we have > with O_DIRECT right now). I'm really interested in ideas on how to provide that. But, with dm, md, and a healthy assortment of flash vendors, I don't know how... I've attached my current test program. The basic idea is to fill buffers (1MB in size) with a random pattern. Each buffer has a different random pattern. You let it run for a while and then pull the plug. After the box comes back up, run the program again and it looks for consistent patterns filling each 1MB aligned region in the file. Usage: gcc -Wall -o atomic-pattern atomic-pattern.c create a heavily fragmented file (exercise for the user, I need to make a mode for this) atomic-pattern file_name init <wait for init done printf to appear> <let it run for a while> <cut power to the box> <box comes back to life> atomic-pattern file_name check In order to reliably find torn blocks without O_ATOMIC, I had to bump the write size to 1MB and run 24 instances in parallel. /* * Copyright 2013 Fusion-io * GPLv2 or higher license */ #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #include <errno.h> #define FILE_SIZE (300 * 1024 * 1024) #define O_DIRECT 00040000ULL #define O_ATOMIC 040000000ULL void set_block_headers(unsigned char *buf, int buffer_size, unsigned long seq) { while (buffer_size > sizeof(seq)) { memcpy(buf, &seq, sizeof(seq)); buffer_size -= sizeof(seq); buf += sizeof(seq); } } int check_block_headers(unsigned char *buf, int buffer_size) { unsigned long seq = 0; unsigned long check = 0; memcpy(&seq, buf, sizeof(seq)); buffer_size -= sizeof(seq); while (buffer_size > sizeof(seq)) { memcpy(&check, buf, sizeof(check)); if (check != seq) { fprintf(stderr, "check failed %lx %lx\n", seq, check); return -EIO; } buffer_size -= sizeof(seq); buf += sizeof(seq); } return 0; } int main(int ac, char **av) { unsigned char *file_buf; loff_t pos; int ret; int fd; int write_size = 1024 * 1024; char *filename = av[1]; int check = 0; int init = 0; if (ac < 2) { fprintf(stderr, "usage: atomic-pattern filename [check | init]\n"); exit(1); } if (ac > 2) { if (!strcmp(av[2], "check")) { check = 1; fprintf(stderr, "checking %s\n", filename); } else if (!strcmp(av[2], "init")) { init = 1; fprintf(stderr, "init %s\n", filename); } else { fprintf(stderr, "usage: atomic-pattern filename [check | init]\n"); exit(1); } } ret = posix_memalign((void **)&file_buf, 4096, write_size); if (ret) { perror("cannot allocate memory\n"); exit(1); } fd = open(filename, O_RDWR, 0600); if (fd < 0) { perror("open"); exit(1); } ret = fcntl (fd, F_SETFL, O_DIRECT | O_ATOMIC); if (ret) { perror("fcntl"); exit(1); } pos = 0; if (!init && !check) goto runit; while (pos < FILE_SIZE) { if (check) { ret = pread(fd, file_buf, write_size, pos); if (ret != write_size) { perror("write"); exit(1); } ret = check_block_headers(file_buf, write_size); if (ret) { fprintf(stderr, "Failed check on buffer %llu\n", (unsigned long long)pos); exit(1); } } else { set_block_headers(file_buf, write_size, rand()); ret = pwrite(fd, file_buf, write_size, pos); if (ret != write_size) { perror("write"); exit(1); } } pos += write_size; } if (check) exit(0); fsync(fd); runit: fprintf(stderr, "File init done, running random writes\n"); while (1) { pos = rand() % FILE_SIZE; pos = pos / write_size; pos = pos * write_size; if (pos + write_size > FILE_SIZE) pos = 0; set_block_headers(file_buf, write_size, rand()); ret = pwrite(fd, file_buf, write_size, pos); if (ret != write_size) { perror("write"); exit(1); } } return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html