-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I suggest setting logging to 0/5 on everything. Depending on your desire for reliability and availability, you may want to change your pool min_size/size to 2/4 and adjust your CRUSH map to include rack. Then instruct CRUSH to place two copies in each rack. That way if you lose power to a rack, you can still continue with minimal interruption. You would want a rule similar to this: rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step choose firstn 2 type rack step chooseleaf firstn 2 type host step emit } I would also set: mon osd downout subtree limit = host so that if you lose power in a rack it won't try to recover. If you only have two racks, this is not an issue. If you move to three racks, then you can adjust the min_size/size to 2/3 and adjust the rule to: rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type rack step emit } Other than that, the defaults are pretty good. -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.0.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV32vaCRDmVDuy+mK58QAAt3EP/0VPChXtbijtIXZmItuG H+e4moCAfsu5dLAfpdorZOEivjh2xVdni9XlHlBE8Qm7UmfpyycP1SUST8bd 3BcI2xC0xlV0xJShJcoL5+vXyVZYPhrSKdooCuo5coYhRZOtSqg86uVojpHA 8hy0eLVd8qXKjvqvQJBIDZXQP41Ct6UoejT+sP7JuepH9SWb+0c61+TpOCQm BSTraapfyqNxo5y40FI7pM7E0EZw1H3Ag8Ie1HiQ3NfbkVQ4N4KMmRGzsCzl QpZB/gAkUmdpJptRUzo2habaLzl0szuaXiP/JnFE8Vu5H2GnrsFelHfOnQQx hrEhqfVXtZ7oCQLYy0N+KpgfAf9b7+2kA9Tm8Ztx+nw8YOgAPrWheFUj9Jjs Ry9dK/J9toaKAXfW12EKiU+qNKOgHYKEn+FSR+y+y7UJSbexhmeUhPy5S4Jt he1KJMUe7BnGRuFM/94vCCApAgqoHiatpFeKY7cEd6x0V3YOA+j8MDbr5YWJ PCWXWyFpClyp9h9LW0uqlwE3LtYBD0ec3d4nJmqNy5v2sszWJo4UWptRhEdi XOwoda3DNnqoj5G7dmKkSrvXJqSRXA784gIMD0rO7JfXlahjCOsVaYQdo76v U+bQtxGRTXTAV+1ygOL7rElXMyc4Wo6IyUkpE6dnhFPGsi0lZnOih+kM0Wmt wt/B =mSex -----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Thu, Aug 27, 2015 at 1:42 PM, German Anders <ganders@xxxxxxxxxxxx> wrote:
Thanks a lot Robert and Jan for the comments about the available and possible disk layouts. Is there any advice from the point of view of configuration? any tunable parameters, crush algorithm?Thanks a lot,Best regards,German
2015-08-27 16:37 GMT-03:00 Robert LeBlanc <robert@xxxxxxxxxxxxx>:-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On Thu, Aug 27, 2015 at 1:13 PM, Jan Schermer wrote: > >> On 27 Aug 2015, at 20:57, Robert LeBlanc wrote: >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> >> >> >> On Thu, Aug 27, 2015 at 10:25 AM, Jan Schermer wrote: >>> Some comments inline. >>> A lot of it depends on your workload, but I'd say you almost certainly need >>> higher-grade SSDs. You can save money on memory. >>> >>> What will be the role of this cluster? VM disks? Object storage? >>> Streaming?... >>> >>> Jan >>> >>> On 27 Aug 2015, at 17:56, German Anders wrote: >>> >>> Hi all, >>> >>> I'm planning to deploy a new Ceph cluster with IB FDR 56Gb/s and I've the >>> following HW: >>> >>> 3x MON Servers: >>> 2x Intel Xeon E5-2600@v3 8C >> >> This is overkill if only a monitor server. > > Maybe with newer releases of Ceph, but my Mons spin CPU pretty high (100% core, which means it doesn't scale that well with cores), and when adding/removing OSDs or shuffling data some of the peering issues I've seen were caused by lagging Mons. If I remember right, you have a fairly large cluster. This is a pretty small cluster, so probably OK with less CPU. Are you running Dumpling? I haven't seen many issues with Hammer. > >> >>> >>> 256GB RAM >>> >>> >>> I don't think you need that much memory, 64GB should be plenty (if that's >>> the only role for the servers). >> >> >> If it is only monitor, you can get by with even less. >> >>> >>> 1xIB FRD ADPT-DP (two ports for PUB network) >>> 1xGB ADPT-DP >>> >>> Disk Layout: >>> >>> SOFT-RAID: >>> SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1) >>> SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1) >>> >>> >>> I 100% recommend going with SSDs for the /var/lib/ceph/mon storage, fast >>> ones (but they can be fairly small). Should be the same grade as journal >>> drives IMO. >>> NOT S3500! >>> I can recommend S3610 (just got some :)), Samsung 845 DC PRO. At least 1 >>> DWPD rating, better go with 3 DWPD. >> >> S3500 should be just fine here. I get 25% better performance on the >> S3500 vs the S3700 doing sync direct writes. Write endurance should be >> just fine as the volume of data is not going to be that great. Unless >> there is something else I'm not aware of. >> > > S3500 is faster than S3700? I can compare 3700 x 3510 x 3610 tomorrow but I'd be very surprised if the S3500 had a _sustained_ throughput better than 36xx or 37xx. Were you comparing that on the same HBA and in the same way? (No offense, just curious) None taken. I used the same box and swapped out the drives. The only difference was the S3500 has been heavily used, the 3700 was fresh from the package (if anything that should have helped the S3700). for i in {1..8}; do fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=$i --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test; done # jobs IOPs Bandwidth (KB/s) Intel S3500 (SSDSC2BB240G4) Max 4K RW 7,500 1 5,617 22,468.0 2 8,326 33,305.0 3 11,575 46,301.0 4 13,882 55,529.0 5 16,254 65,020.0 6 17,890 71,562.0 7 19,438 77,752.0 8 20,894 83,576.0 Intel S3700 (SSDSC2BA200G3) Max 4K RW 32,000 1 4,417 17,670.0 2 5,544 22,178.0 3 7,337 29,352.0 4 9,243 36,975.0 5 11,189 44,759.0 6 13,218 52,874.0 7 14,801 59,207.0 8 16,604 66,419.0 9 17,671 70,685.0 10 18,715 74,861.0 11 20,079 80,318.0 12 20,832 83,330.0 13 20,571 82,288.0 14 23,033 92,135.0 15 22,169 88,679.0 16 22,875 91,502.0 > > Mons can use some space, I've experienced logging havoc, leveldb bloating havoc (I have to compact manually or it just grows and grows), and my Mons write quite a lot at times. I guesstimate my mons can write 200GB a day, often less but often more. Maybe that's not normal. I can confirm those numbers tomorrow. True, I haven't had the compact issues so I can't comment on that. He has a small cluster so I don't think he will get to the level you have. > >>> >>> >>> 8x OSD Servers: >>> 2x Intel Xeon E5-2600@v3 10C >>> >>> >>> Go for the fastest you can afford if you need the latency - even at the >>> expense of cores. >>> Go for cores if you want bigger throughput. >> >> I'm in the middle of my testing, but it seems that with lots of I/O >> depth (either from a single client or multiple clients) that clock >> speed does not have as much of an impact as core count does. Once I'm >> done, I'll be posting my results. Unless you have a single client that >> has a QD=1, go for cores at this point. > > NoSQL is basically still a database, and while NoSQL is mostly a more modern stuff which is built for clouds and horizontal scaling, you still need some baseline performance to achieve a good durability/replication and stuff. > >> >>> >>> 256GB RAM >>> >>> >>> Again - I think too much if that's the only role for those nodes, 64GB >>> should be plenty. >> >> Agree, if you can afford more RAM, it just means more page cache. > > But too much page cache = bad. I think /proc/sys/vm/min_free_kbytes help. > >> >>> >>> >>> 1xIB FRD ADPT-DP (one port for PUB and one for CLUS network) >>> 1xGB ADPT-DP >>> >>> Disk Layout: >>> >>> SOFT-RAID: >>> SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1) >>> SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1) >>> >>> JBOD: >>> SCSI9 (0,0,0) (sdd) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal) >>> SCSI9 (0,1,0) (sde) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal) >>> SCSI9 (0,2,0) (sdf) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal) >>> >>> >>> No no no. Those SSDs will die a horrible death, too little endurance. >>> Better go with 2x 3700 in RAID1 and partition them for journals. Or just >>> don't use journaling drives and buy better SSDs for storage. >> >> If he is only using these for journals, he can be just fine. He can >> get the same endurance as the S3700 by only using a portion of the >> drive space. [1][2] > > True for the 120GB drives. You only really need something like 1-10GB at most. > I'd still get a smaller higher-class drive and just not touch provisioning, if only for the sake of warranty. But I think it's easier to just skip dedicated journal drives in this case. I think I remember someone saying that journals on separate SSDs gave them better performance than journals co-located on the SSD, I don't remember though. If warranty replacement is your primary concern, then go with the 3700. If they already have the 3500, they can get it to perform/endure like the 3700 with the only cost is disk space. > > NoSQL is very write intensive - depending on implemenation (applications) of course. But it's not unusual to have 300MB of semi-structured data and 100GB indexes that are rebuilt all the time (of course that indicates the developers were just lazystupid, which is exactly why NoSQL is so popular and Agile :)). Understandable. Our cluster is primarily write because reads are being served out of all the layers of cache. Overprovisioned 3500s will work just as well as the 3700. - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.0.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV32bnCRDmVDuy+mK58QAA0e4P/3jclEcvCRWgOYwUz0bo scf42NOhyNp3bPt4sUMN5h1aptX1s9TtUQxaq9yficjHhIb9ZBt1/SPxzDpf cbWBMgjKgEPHhN7AAGK6HwlQ+zrB8znRPabv81JO9heIwrcOY7LLJTl8kpij 0ktU7oRBn4xTDINTugZnq+YaBL+8N1/5g65lev6nnMs9ngTh4DSmjYuDjxFH Y8YuToImBQtuUQiL4feNN+lA+fPy3k0iYaTS2XvO7yX+w84ElDjUHvjZxOTt kZE5/YMKz7sImhhvLmvRRpqpEbJVPDl6JqhbyMTwpH4fkebrEGY/EbVYV+bT m3Hq6iMIs2NleExShOwdUK0r0cw1MnWPThdEtOAHefefDcsWPZoQpvPiuqwJ MdFxGP1LnX7yx1vYAt89nRhUsBQUvCcparcjjbM4aIe/6Q39Orkqb4sMuygf VyxFRwULDPwnl6xMn/oVIAXycXOMs3dWM12t6UGfe4kmSGEoShzkwimgJcvC lQnrp8u6jFYz6lflMMOQRauJSA4vDAU63JJMb7MLDqI6zy7MqXjnA9kyS1PP Px7mgxLINQ/KG4ymGtlRNKfZVF29fe+CGYZEwrVFsRGAIJsfG9TZj3IhdO1r /9gkXHvvE6NMPQWWNwxnvnFseqdNDbCZl3DFy9fciCgofznNo2sQumY8eG9P k5jF =HkOn -----END PGP SIGNATURE-----
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com