On Tue, Oct 16, 2012 at 7:27 AM, Maciej Gałkiewicz <maciejgalkiewicz@xxxxxxxxxxxxx> wrote: > Hello > > I have two ceph clusters configured this way: > > production: > # cat /etc/ceph/ceph.conf > [global] > auth supported = cephx > keyring = /srv/ceph/keyring.admin > > [mon] > mon data = /srv/ceph/mon > mon clock drift allowed = 0.5 > > [mon.n11c1] > host = n11c1 > mon addr = 1.1.1.49:6789 > [mon.cc2] > host = cc2 > mon addr = 1.1.1.48:6789 > [mon.n13c1] > host = n13c1 > mon addr = 1.1.1.51:6789 > > [mds] > debug mds = 1 > keyring = /srv/ceph/ceph-stage2/keyring.$name > > [mds.n11c1] > host = n11c1 > > [osd] > osd data = /srv/ceph/$name > osd journal = /srv/ceph/$name.journal > osd journal size = 1000 > keyring = /srv/ceph/ceph-stage2/keyring.$name > debug osd = 1 > > [osd.2] > host = n14c1 > [osd.0] > host = n11c1 > [osd.3] > host = n13c1 > [osd.1] > host = n12c1 > > staging: > # cat /etc/ceph/ceph.conf > [global] > auth supported = cephx > keyring = /srv/ceph/keyring.admin > > [mon] > mon data = /srv/ceph/mon > mon clock drift allowed = 0.5 > > [mon.cc] > host = cc > mon addr = 1.1.1.35:6789 > [mon.n3cc] > host = n3cc > mon addr = 1.1.1.34:6789 > > [mds] > debug mds = 1 > keyring = /srv/ceph/ceph-stage2/keyring.$name > > [mds.cc] > host = cc > [mds.n3cc] > host = n3cc > mds standby replay = true > mds standby for name = cc > > [osd] > osd data = /srv/ceph/$name > osd journal = /srv/ceph/$name.journal > osd journal size = 1000 > keyring = /srv/ceph/ceph-stage2/keyring.$name > debug osd = 1 > > [osd.0] > host = cc > [osd.1] > host = n3cc > > I am using RBD volumes mapped on virtual machines which run postgresql > and mongodb databases. Right now there are 15 clients with postgresql > and 10 with mongodb. All clients generate at most 0.4 IOPS (both read > and write). Here are the graphs (writes per second) for nodes with > osds from last week: > https://www.dropbox.com/s/djnpcxb6a9ktzv8/ceph_stats1.png > https://www.dropbox.com/s/ak7npkhm776jarp/ceph_stats2.png > https://www.dropbox.com/s/3lzfaku1nourmle/ceph_stats3.png > > Is it normal that ceph generates so much writes? In staging cluster > there are no clients connected and still there are about 40 writes/s. > I have also checked admin socket and dumped ops_in_flight. There were > 0 ops. What's this data showing us? Write requests to any disk on the node of any size? That's more than I'd expect to see, but Ceph is going to engage in some background chatter, and I notice you have at least a little bit of logging enabled. > The situation on production is also weird for me. There are a lot of > writes which do not reflect volume usage on clients. Looks like ceph > have a big overhead or more likely something is wrong with the > cluster. > > Benchmarks from both clusters: > > root@n12c1[production]:/srv/ceph# rados -p todo-list bench 60 write -b > 4096 -t 16 > Maintaining 16 concurrent writes of 4096 bytes for at least 60 seconds. > Object prefix: benchmark_data_n12c1_2638 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 0 0 0 0 0 0 - 0 > 1 16 20 4 0.0156214 0.015625 0.798329 0.696832 > 2 16 20 40.00781079 0 - 0.696832 > 3 16 20 40.00520727 0 - 0.696832 > 4 16 20 40.00390549 0 - 0.696832 > 5 16 20 40.00312442 0 - 0.696832 > 6 16 20 40.00260369 0 - 0.696832 > 7 16 20 40.00223175 0 - 0.696832 > 8 16 20 40.00195278 0 - 0.696832 > 9 16 20 40.00173581 0 - 0.696832 > 10 16 20 40.00156223 0 - 0.696832 > 11 16 20 40.00142021 0 - 0.696832 > 12 16 20 40.00130186 0 - 0.696832 > 13 16 20 40.00120171 0 - 0.696832 > 14 16 20 40.00111588 0 - 0.696832 > 15 16 20 40.00104149 0 - 0.696832 > 16 16 20 40.000976395 0 - 0.696832 > 17 16 20 40.000918959 0 - 0.696832 > 18 16 20 40.000867905 0 - 0.696832 > 19 16 24 80.001644450.000868056 18.9809 9.83857 > 2012-10-16 16:10:18.553915min lat: 0.391531 max lat: 18.9809 avg lat: 9.83857 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 20 16 24 80.00156223 0 - 9.83857 > 21 16 25 90.001673810.00195312 20.6347 11.0381 > 22 16 36 200.00355051 0.0429688 0.259829 14.4342 > 23 16 36 200.00339614 0 - 14.4342 > 24 16 36 200.00325464 0 - 14.4342 > 25 16 37 210.003280680.00130208 23.8748 14.8838 > 26 16 37 21 0.0031545 0 - 14.8838 > 27 16 37 210.00303766 0 - 14.8838 > 28 16 37 210.00292918 0 - 14.8838 > 29 16 37 210.00282817 0 - 14.8838 > 30 16 37 21 0.0027339 0 - 14.8838 > 31 16 37 210.00264571 0 - 14.8838 > 32 16 37 210.00256303 0 - 14.8838 > 33 16 37 210.00248537 0 - 14.8838 > 34 16 37 210.00241227 0 - 14.8838 > 35 16 37 210.00234335 0 - 14.8838 > 36 16 37 210.00227825 0 - 14.8838 > 37 16 37 210.00221668 0 - 14.8838 > 38 16 37 210.00215834 0 - 14.8838 > 39 16 37 21 0.002103 0 - 14.8838 > 2012-10-16 16:10:38.557419min lat: 0.258984 max lat: 23.8748 avg lat: 14.8838 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 40 16 37 210.00205043 0 - 14.8838 > 41 16 37 210.00200041 0 - 14.8838 > 42 16 37 210.00195279 0 - 14.8838 > 43 16 37 210.00190737 0 - 14.8838 > 44 16 37 210.00186402 0 - 14.8838 > 45 16 37 21 0.0018226 0 - 14.8838 > 46 16 37 210.00178298 0 - 14.8838 > 47 16 37 210.00174504 0 - 14.8838 > 48 16 37 210.00170869 0 - 14.8838 > 49 16 37 210.00167382 0 - 14.8838 > 50 16 37 210.00164034 0 - 14.8838 > 51 16 37 210.00160818 0 - 14.8838 > 52 16 37 210.00157725 0 - 14.8838 > 53 16 37 210.00154749 0 - 14.8838 > 54 16 37 210.00151884 0 - 14.8838 > 55 16 37 210.00149122 0 - 14.8838 > 56 16 37 210.00146459 0 - 14.8838 > 57 16 37 21 0.0014389 0 - 14.8838 > 58 16 37 210.00141409 0 - 14.8838 > 59 16 37 210.00139012 0 - 14.8838 > 2012-10-16 16:10:58.560825min lat: 0.258984 max lat: 23.8748 avg lat: 14.8838 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 60 16 37 210.00136695 0 - 14.8838 > 61 16 37 210.00134454 0 - 14.8838 > 62 16 37 210.00132286 0 - 14.8838 > 63 16 37 210.00130186 0 - 14.8838 > 64 16 37 210.00128152 0 - 14.8838 > 65 16 37 21 0.0012618 0 - 14.8838 > 66 6 38 320.001893610.00104802 44.154 25.6537 > 67 6 38 320.00186535 0 - 25.6537 > Total time run: 67.779471 > Total writes made: 38 > Write size: 4096 > Bandwidth (MB/sec): 0.002 > > Stddev Bandwidth: 0.00555568 > Max bandwidth (MB/sec): 0.0429688 > Min bandwidth (MB/sec): 0 > Average Latency: 27.8774 > Stddev Latency: 17.9306 > Max latency: 64.4662 > Min latency: 0.258984 That's pretty hilariously slow — patterns like this usually mean that you have one OSD in particular which is very slow at serving writes. Have you run any benchmarks on the backing filesystems without Ceph in the way? These results combined with the fact that you're using old btrfs nodes, which have been supporting database access patterns, makes me think that you've just got a workload that fragments btrfs horribly and so your backing filesystems are themselves not supporting any real throughput. -Greg > > root@cc[staging]:/var/images# rados -p todo-list-test bench 60 write > -b 4096 -t 16 > Maintaining 16 concurrent writes of 4096 bytes for at least 60 seconds. > Object prefix: benchmark_data_cc_20637 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 0 0 0 0 0 0 - 0 > 1 16 129 113 0.441308 0.441406 0.089088 0.130034 > 2 16 228 212 0.413976 0.386719 0.170227 0.144586 > 3 16 324 308 0.400965 0.375 0.17551 0.149879 > 4 16 481 465 0.454019 0.613281 0.081618 0.136798 > 5 16 628 612 0.478042 0.574219 0.080181 0.124768 > 6 16 641 625 0.406831 0.0507812 0.707781 0.136926 > 7 16 641 625 0.348713 0 - 0.136926 > 8 16 659 643 0.313911 0.0351562 0.281386 0.193741 > 9 16 739 723 0.313749 0.3125 0.177852 0.196558 > 10 16 865 849 0.331585 0.492188 0.142906 0.18791 > 11 16 1009 993 0.352569 0.5625 0.089554 0.176182 > 12 16 1059 1043 0.339462 0.195312 0.132217 0.174339 > 13 16 1075 1059 0.318157 0.0625 1.57543 0.195923 > 14 16 1075 1059 0.295431 0 - 0.195923 > 15 16 1076 1060 0.2759960.00195312 2.07952 0.1977 > 16 16 1156 1140 0.278275 0.3125 0.158087 0.22184 > 17 16 1204 1188 0.272933 0.1875 0.08167 0.223639 > 18 16 1221 1205 0.261459 0.0664062 0.714995 0.236254 > 19 16 1222 1206 0.2479040.00390625 0.293113 0.236301 > 2012-10-16 16:14:21.879463min lat: 0.066844 max lat: 2.96921 avg lat: 0.247091 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 20 16 1302 1286 0.251131 0.3125 0.19978 0.247091 > 21 16 1317 1301 0.241961 0.0585938 0.203456 0.246584 > 22 16 1399 1383 0.24552 0.320312 0.158703 0.253264 > 23 16 1511 1495 0.253864 0.4375 0.164127 0.245372 > 24 16 1654 1638 0.266557 0.558594 0.086673 0.234202 > 25 16 1718 1702 0.265894 0.25 0.27639 0.230337 > 26 16 1720 1704 0.255967 0.0078125 1.58549 0.231933 > 27 16 1785 1769 0.255889 0.253906 0.140973 0.238474 > 28 16 1816 1800 0.251075 0.121094 0.179986 0.247877 > 29 16 1913 1897 0.25548 0.378906 0.103972 0.243542 > 30 16 2057 2041 0.265712 0.5625 0.071993 0.234536 > 31 16 2233 2217 0.279314 0.6875 0.090504 0.223128 > 32 16 2264 2248 0.274369 0.121094 0.11962 0.221455 > 33 16 2281 2265 0.268067 0.0664062 0.22729 0.232648 > 34 16 2281 2265 0.260183 0 - 0.232648 > 35 16 2298 2282 0.254646 0.0332031 0.227998 0.244199 > 36 16 2425 2409 0.261351 0.496094 0.116612 0.238956 > 37 16 2553 2537 0.267798 0.5 0.126228 0.232921 > 38 16 2652 2636 0.270926 0.386719 0.12253 0.229696 > 39 16 2665 2649 0.265281 0.0507812 0.205365 0.229567 > 2012-10-16 16:14:41.882716min lat: 0.066844 max lat: 2.96921 avg lat: 0.234852 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 40 16 2730 2714 0.264996 0.253906 0.116825 0.234852 > 41 16 2730 2714 0.258532 0 - 0.234852 > 42 16 2747 2731 0.253958 0.0332031 0.155447 0.245442 > 43 16 2844 2828 0.256862 0.378906 0.159537 0.242571 > 44 16 2892 2876 0.255285 0.1875 0.548109 0.242033 > 45 16 2955 2939 0.25508 0.246094 0.245377 0.243047 > 46 16 2957 2941 0.249705 0.0078125 1.33743 0.243768 > 47 16 3021 3005 0.24971 0.25 0.226326 0.249331 > 48 16 3021 3005 0.244508 0 - 0.249331 > 49 16 3067 3051 0.243184 0.0898438 0.218276 0.255766 > 50 16 3068 3052 0.2383990.00390625 1.28465 0.256103 > 51 16 3068 3052 0.233724 0 - 0.256103 > 52 16 3085 3069 0.230506 0.0332031 0.78895 0.26712 > 53 16 3181 3165 0.233232 0.375 0.128619 0.267592 > 54 16 3244 3228 0.233469 0.246094 0.162226 0.265363 > 55 16 3245 3229 0.2292950.00390625 1.23061 0.265662 > 56 16 3326 3310 0.23085 0.316406 0.130218 0.270394 > 57 16 3421 3405 0.233309 0.371094 0.271192 0.26772 > 58 16 3422 3406 0.2293540.00390625 0.243273 0.267713 > 59 16 3472 3456 0.228777 0.195312 0.172423 0.272543 > 2012-10-16 16:15:01.885897min lat: 0.066844 max lat: 2.96921 avg lat: 0.271267 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 60 16 3536 3520 0.22913 0.25 0.245948 0.271267 > 61 16 3536 3520 0.225373 0 - 0.271267 > Total time run: 61.778350 > Total writes made: 3537 > Write size: 4096 > Bandwidth (MB/sec): 0.224 > > Stddev Bandwidth: 0.199931 > Max bandwidth (MB/sec): 0.6875 > Min bandwidth (MB/sec): 0 > Average Latency: 0.278853 > Stddev Latency: 0.441319 > Max latency: 2.96921 > Min latency: 0.066844 > > The biggest problem in my clusters is poor performance. Simple insert > to the database on the client often takes seconds. It is unacceptable > and I am trying to find the bottleneck. Could you please help me with > it? > > Replica count is set to 2. Kernel 3.2.23, ceph 0.52, osd data are > stored on the same partition (two 7200rpm disks in raid0) as journal > files, filesystem btrfs. Filesystem on n11c1 is 4 months old, n12c1 > and n14c1 are around 1.5 month old, cc and n3cc is around 3 months > old. Both clusters are healthy. Production have around 544 pgs and 46 > pools, staging 232 and 8 pools. Also this is interesting — what are the pools used for? With an average size of 11 PGs/pool on production, and 4 OSDs, then you're likely to have some pretty distributions of writes on a per-pool basis, which would exacerbate any slow OSD problems on the pools which map heavily to that OSD. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html