On Tue, Oct 12, 2010 at 07:30:48PM -0700, Gregory Farnum wrote: > Does this mean you're using cfuse rather than the kernel client? > FUSE performance in general is fairly disappointing and our cfuse is > probably not as fast as the kernel client even so, though I don't > think it should be *that* unhappy in most environments. No, I'm using the kernel client (from 2.6.34). Specifically, I'm doing a "modprobe ceph; mount -t ceph 1.2.3.4:/ /mnt" Sorry, I should have mentioned that. I can use a more recent kernel (i.e., 2.6.36-rc7) if that's likely to help. > So you have 5 journals running on one spindle? This could be the cause > of your slightly low sequential write performance; in the current > default configuration writes have to go to the journal before going to > the main disk and with multiple OSDs on one journal spindle they could > be getting in each other's way. Hmm, what do you recommend, then? The problem is if the journal only needs to be a few gigabytes (I used a 5GB file), using an entire 1T or 2T disk just so each of the journals can have their own spindle is pretty wasteful. > Also, how much memory do you have on these machines? 32GB > I'm not familiar with FFSB and there doesn't seem to be any > easily-accessible documentation, can you tell us a little more about > how it works? For instance, how are the test files created (are they > written out for the reads and then tested? Does the random write > create the files as it goes, or are they pre-existing and then > overwritten)? There's a quicky explanation of these workloads at http://btrfs.boxacle.net (I'm using the raid configuration FFSB files), but essentially, the large file create test is creating 100MB files as quickly as possible. In the rest of the tests we create 1024 100MB files, and then try (a) reading from them sequentially as quickly as possible, (b) picking a random file, and a random offset, and read 5MB, and repeat, (c) picking a random file, and a random offset, and write 5MB, and repeat. The creation of the 1024 100MB files (if necessary; the tests will reuse the previously created set of 100MB files) is not counted in the benchmark time. So in the last three tests there is no block allocation; just the time it takes to read or overwrite existing data blocks. Note BTW that this is not intrinsic to FFSB; FFSB stands for the "flexible filesystem benchmark" system. All of this is configurable using the ffsb config files. I'm just reusing the "boxacle workloads" just because they are convenient, and I'm familiar with how they work on local disk filesystems. They're used for example for benchmarking ext4 here: http://free.linux.hp.com/~enw/ext4/2.6.35/, and for btrfs here: http://btrfs.boxacle.net. (And when the IBM folks have done btrfs benchmarks, since they are so detailed and the hardware/configurations are so well described, I've also used them to help improve ext4's performance.) > A few thoughts/wild guesses: > I'm not sure exactly what the limit is, but 114MB/s reads are close to > what you can get over a 1Gb link. > If single-threaded FFSB means there's only one request in-flight at a > time there may be a latency issue which is causing those 35MB/s reads. > The kernel client ought to be prefetching but maybe it's not doing so > properly, and I don't recall how much prefetching cfuse is actually > capable of. Sage can say more on this. I'm not using cfuse; I'm using the in-kernel Ceph module. As far as network latency is concerned, ping RTT time is under 0.25ms. And sure maybe it's a prefetching issue --- but in that case I would have expected 8 thread would have had better than 2x the 1 thread case. - Ted -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html