Re: Read ahead affect Ceph read performance much

Andrey Korolyov <andrey@xxxxxxx> · Mon, 29 Jul 2013 17:00:24 +0400

Wow, very glad to hear that. I tried with the regular FS tunable and
there was almost no effect on the regular test, so I thought that
reads cannot be improved at all in this direction.

On Mon, Jul 29, 2013 at 2:24 PM, Li Wang <liwang@xxxxxxxxxxxxxxx> wrote:
> We performed Iozone read test on a 32-node HPC server. Regarding the
> hardware of each node, the CPU is very powerful, so does the network, with a
> bandwidth > 1.5 GB/s. 64GB memory, the IO is relatively slow, the throughput
> measured by ‘dd’ locally is around 70MB/s. We configured a Ceph cluster with
> 24 OSDs on 24 nodes, one mds, one to four clients, one client per node. The
> performance is as follows,
>
>     Iozone sequential read throughput (MB/s)
> Number of clients     1          2         4
> Default resize    180.0954   324.4836   591.5851
> Resize: 256MB     645.3347   1022.998   1267.631
>
> The complete iozone parameter for one client is,
> iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w -c -e
> -b /tmp/iozone.nodelist.50305030.output, on each client node, only one
> thread is started.
>
> for two clients, it is,
> iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w -c -e
> -b /tmp/iozone.nodelist.50305030.output
>
> As the data shown, a larger read ahead window could result in >300% speedup!
>
> Besides, Since the backend of Ceph is not the traditional hard disk, it is
> beneficial to capture the stride read prefetching. To prove this, we tested
> the stride read with the following program, as we know, the generic read
> ahead algorithm of Linux kernel will not capture stride-read prefetch, so we
> use fadvise() to manually force pretching.
> the record size is 4MB. The result is even more surprising,
>
>             Stride read throughput (MB/s)
> Number of records prefetched  0      1      4      16      64      128
> Throughput                  42.82  100.74 217.41  497.73  854.48  950.18
>
> As the data shown, with a read ahead size of 128*4MB, the speedup over
> without read ahead could be up to 950/42 > 2000%!
>
> The core logic of the test program is below,
>
> stride = 17
> recordsize = 4MB
> for (;;) {
>   for (i = 0; i < count; ++i) {
>     long long start = pos + (i + 1) * stride * recordsize;
>     printf("PRE READ %lld %lld\n", start, start + block);
>     posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED);
>   }
>   len = read(fd, buf, block);
>   total += len;
>   printf("READ %lld %lld\n", pos, (pos + len));
>   pos += len;
>   lseek(fd, (stride - 1) * block, SEEK_CUR);
>   pos += (stride - 1) * block;
> }
>
> Given the above results and some more, We plan to submit a blue print to
> discuss the prefetching optimization of Ceph.
>
> Cheers,
> Li Wang
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html