Hi all: After some time, I've got several consolidated (hopefully trivial) questions. I've tried to cut them short, sorry for asking too many at once. kernel 2.6.36-02063601-generic (23-Nov-2010) ceph version 0.23.2 (commit: 5bdae2af8c53adb2e059022c58813e97e7a7ba5d) 1. about the bug track: http://tracker.newdream.net/issues/479 when using Ext4, I now found that the hang happens when a client runs on the same machine as the ceph-system. But the hang does not happen when a client runs on different machine. i.e., - machine1 (MDS,MON,OSD, plus Client mounts) = hangs - machine1 (MDS,MON, OSD, plus separate (machine2) Client mounts) = ok. I think this is because of the same IP address (loopback) with some sync mismatch..? For btrfs, the only no-hang way is to disable journaling (writeahead and writeparallel). DD copies keep crashing (mount hangs and makes it pretty much unmountable again until hard reboot) at about 576MB. 2. For benchmarking the entire ceph system in general, could you recommend which tools to use, particularly best for Object based storages? Would the block based storage benchmark, e.g., Iometer, be a bad choice?, and why and why not? I'm running fio (iometer-alike) to do random/sequentail read/write. I did rados test, too. 'rados -p metadata bench write' command should be writing to mds node, but it was doing it on osd nodes (i.e,, same as the data node) also, 'seq' crashes with lots of errors. 3. I've got four powerful machines, 2 for osd nodes, 1 for mds/mon together and 1 for clients. But I was mainly testing for just 1 osd node for a basic test. The tests are independently ran for SSDs first and then again for HDDs. (These disks are connected to HW RAID controllers (but all individual configured for recommended optimal for servers, e.g., 8k, directio, directcache, (writethru for ssd, and writeback for hdd) The repl rules are all set to 1x (and again to default 2x). All osds use 10G journal size (journal write ahead, parallel, and noJournal, and ext4) all links are 2Gb links and so I get near 230MB/s iperf. Once Ceph started, I first do a large single-file dd copy (40GB for SSDs, and 500GB for HDDs), twiki.esc.auckland.ac.nz/twiki/bin/view/NDSG/Cephtest the plot is here, for DD twiki.esc.auckland.ac.nz/twiki/pub/NDSG/Cephtest/ddplot.pdf After the files are created, I ran iometer-like (fio) for random/sequential read/write (RR, SR, RW, and SW) to the file I just created (i.e., 40GB for ssd and 500GB for hdd) The results are here, (note, the plot is quite 'concised' and dense!) twiki.esc.auckland.ac.nz/twiki/pub/NDSG/Cephtest/fioplot.pdf The only understandable parts are the sequential 'read' parts, writes are slow, and I'm not necessarily seeing any 'gaps' observed for any osd change. Questions. - When journal size = 10000, why does the journal file in the /data/ shows as 1.8GB ? - Also, journal write ahead or parallel? which mode do I need to set for ssd or harddisk? (the plot answers, basically ahead/parallel okay) - Increasing osds (1 to 3 to 6) doesnt seem to increase any performance at all (if so, hardly by a few %). - In some cases, 6osds is lower than 3osds (for hdds DD copy in particular, 1osd to 3osd increase, but 6osds deteriorates badly) - so I ended up using 2 nodes, with (1+1, 2+2, and 3+3 osd config), and for hdds it seems to have increased linearly for DD. - so why is having 6osds in 1 node, is slower than 3+3osds in 2 nodes? at least some bottlenecks should occur in-between the nodes. Crushmap appears to be the same. I was really expecting to see some linear increase or even visible increase as I add more disks, but appears not at all... 4. For the PG setup, using a simple ceph.conf (just osd[123] to a single machine) and setting the repl to x1, how would I expect the objects to get mapped to disks? How do I really configure and control how things go? Currently, rule data { ruleset 0 type replicated min_size 1 max_size 10 step take root step choose firstn 0 type device step emit } Is it normal for it to be always 'type replicated' ? Does it supposed to be different when repl set to x2, x3, etc? How do I check and correctly make sure that the data are distrbuted among all 3 (or 6) osds? (for both write and read, like the RAID0) Using crushtool --num_osds 3 -o file --build host straw 0 root straw 0 # buckets host host0 { id -1 # do not change unnecessarily alg straw hash 0 # rjenkins1 item device0 weight 1.000 } ...(omitted for host1 and host2) root root { id -4 # do not change unnecessarily alg straw hash 0 # rjenkins1 item host0 weight 1.000 item host1 weight 1.000 item host2 weight 1.000 } # rules rule data { ruleset 1 type replicated min_size 2 max_size 2 step take root step chooseleaf firstn 0 type host step emit } What exactly is the difference between the above and the existing map (from default ceph.conf)? Why is the ruleset starts from 1? Does it have to be different for each rule? where is this number min/max 2 came from? why is it set to chooseleaf? (what mode do I set to 'choose'?) Also, it doesn't list metadata rule, so I had to manually add that in. I've added --crushmapsrc file.txt when mkcephfs, but when I read it back from ceph, it seems to return the normal one (i.e., ceph.conf regenerated) So I end up using the online approach after the ceph started, and mounted, but as I write files, it hangs. 5. Other than the ceph, crush and rados papers, Has there been any ceph work done on the applications/performances/benchmark comparisons using others? Thanks a lot in advance, and again sorry for asking too many. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html