Wow, very glad to hear that. I tried with the regular FS tunable and there was almost no effect on the regular test, so I thought that reads cannot be improved at all in this direction. On Mon, Jul 29, 2013 at 2:24 PM, Li Wang <liwang@xxxxxxxxxxxxxxx> wrote: > We performed Iozone read test on a 32-node HPC server. Regarding the > hardware of each node, the CPU is very powerful, so does the network, with a > bandwidth > 1.5 GB/s. 64GB memory, the IO is relatively slow, the throughput > measured by ‘dd’ locally is around 70MB/s. We configured a Ceph cluster with > 24 OSDs on 24 nodes, one mds, one to four clients, one client per node. The > performance is as follows, > > Iozone sequential read throughput (MB/s) > Number of clients 1 2 4 > Default resize 180.0954 324.4836 591.5851 > Resize: 256MB 645.3347 1022.998 1267.631 > > The complete iozone parameter for one client is, > iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w -c -e > -b /tmp/iozone.nodelist.50305030.output, on each client node, only one > thread is started. > > for two clients, it is, > iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w -c -e > -b /tmp/iozone.nodelist.50305030.output > > As the data shown, a larger read ahead window could result in >300% speedup! > > Besides, Since the backend of Ceph is not the traditional hard disk, it is > beneficial to capture the stride read prefetching. To prove this, we tested > the stride read with the following program, as we know, the generic read > ahead algorithm of Linux kernel will not capture stride-read prefetch, so we > use fadvise() to manually force pretching. > the record size is 4MB. The result is even more surprising, > > Stride read throughput (MB/s) > Number of records prefetched 0 1 4 16 64 128 > Throughput 42.82 100.74 217.41 497.73 854.48 950.18 > > As the data shown, with a read ahead size of 128*4MB, the speedup over > without read ahead could be up to 950/42 > 2000%! > > The core logic of the test program is below, > > stride = 17 > recordsize = 4MB > for (;;) { > for (i = 0; i < count; ++i) { > long long start = pos + (i + 1) * stride * recordsize; > printf("PRE READ %lld %lld\n", start, start + block); > posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED); > } > len = read(fd, buf, block); > total += len; > printf("READ %lld %lld\n", pos, (pos + len)); > pos += len; > lseek(fd, (stride - 1) * block, SEEK_CUR); > pos += (stride - 1) * block; > } > > Given the above results and some more, We plan to submit a blue print to > discuss the prefetching optimization of Ceph. > > Cheers, > Li Wang > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html