Re: Disk/Pool Layout

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Fri, 28 Aug 2015 10:12:18 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Fri, Aug 28, 2015 at 2:25 AM, Jan Schermer  wrote:

> We are tying to get some 3610s in to test. I'm interested to know your results.
>
Exactly the same benchmark as yours
On AHCI (SATA2)
S3610 1200GB : 11000 IOPS with QD=1, 44000 with 32 jobs - this is skewed by the SATA2 on my AHCI test machine
S3700 120GB: I'll have to pull one from the cluster to refresh my memory. But I think it did 40K and up to 80K with higher number of jobs, it was much faster than anything else.
Kingston KC300: 2100 IOPS (ouch) and didn't scale with jobs
Kingston KC300 newer shitty version: 5200 IOPS and didn't scale
Samsung 845DC PRO: 14800 IOPS

If you got 11,000 write IOPs from the 3610 with QD=1, then that is twice what I got with the 3500 or 3700. Granted I'm running on Atoms with all this testins so I may not be able to compare with other people, only do relative testing within the environment. I tested a Kingston KC300 (different hardware as this is currently in our monitors and I didn't trim the drive) as well as some brand new OCZ drives. Intel is smoking these drives and I need to get our monitor drives replaced.

# jobs  IOPs  Bandwidth (KB/s)
Kingston KC300 (SKC300S37A120G) Max 4K RW 64,000
1        559    2,239.5
2        573    2,293.3
3        843    3,373.1
4      1,111    4,445.4
5      1,389    5,559.8
6      1,661    6,647.9
7      1,947    7,791.9
8      2,213    8,853.8

OCZ INTREPID 3700 Max 4K RW 9,000
1        277    1,108.4
2        409    1,638.9
3        288    1,153.8
4        323    1,292.2
5        343    1,373.1
6        368    1,474.4
7        381    1,526.1
8        393    1,573.8

OCZ SABER 1000 Max 4K RW 16,000
1        249      997.2
2        258    1,033.9
3        257    1,028.9
4        333    1,333.3
5        411    1,647.1
6        480    1,923.2
7        604    2,418.3
8        644    2,578.6

I don't trust LSI and the numbers I'm getting from there. LSI+XFS/ext4/ext3+Kingston=98 IOPS in this benchmark, Intels drop down but are simply faster while Samsungs were unaffected by whatever issue is there. LSI support said my drives are not on HCL so I should go f*k myself basically.
My LSI is mpt2sas too - do you have a CentOS 6 machine somewhere that you can test on? It would be interesting to see if you have this issue as well. I don't know it it's present on CentOS 7.
Some combination of sync writes causes LSI to do something horrible to the drives that increases their latency to 5ms+, and this happens when instead of testing the raw device you test a file of the filesystem and _not_ preallocate it.

I initially tested the Kingston drives while it was in a mirror with ext4 and running a Ceph monitor (healthy cluster), I got half the IOPs than when I broke it out of the mirror and did the test on the raw device.

> for i in {1..8}; do fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=$i --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test; done
>
> # jobs  IOPs   Bandwidth (KB/s)
>
> Intel S3500 (SSDSC2BB240G4) Max 4K RW 7,500
> 1       5,617  22,468.0
> 2       8,326  33,305.0
> 3      11,575  46,301.0
> 4      13,882  55,529.0
> 5      16,254  65,020.0
> 6      17,890  71,562.0
> 7      19,438  77,752.0
> 8      20,894  83,576.0
>
> Intel S3700 (SSDSC2BA200G3) Max 4K RW 32,000
>  1      4,417  17,670.0
>  2      5,544  22,178.0
>  3      7,337  29,352.0
>  4      9,243  36,975.0
>  5     11,189  44,759.0
>  6     13,218  52,874.0
>  7     14,801  59,207.0
>  8     16,604  66,419.0
>  9     17,671  70,685.0
> 10     18,715  74,861.0
> 11     20,079  80,318.0
> 12     20,832  83,330.0
> 13     20,571  82,288.0
> 14     23,033  92,135.0
> 15     22,169  88,679.0
> 16     22,875  91,502.0
>
> > But too much  page cache = bad.
>
> I think /proc/sys/vm/min_free_kbytes help.
> Nope. Had that set all the way up to 10G with no effect.
> One scenario (I think I described it here already) is when I start a new OSD. The new OSD needs to allocate ~2GB of memory and if it isn't truly "free" then it causes all sorts of problems (peering stuck, slow ops...). Lowering min_free_kbytes or dropping caches helps because it makes the memory actually available fto the OSD and it starts right up, but that's not a nice solution.
> This is CentOS6/RHEL6 with 2.6.32 Redhat frankenkernel with backports and a lot of patches that interact in mysterious ways...
>
> This is good info. We are on CentOS 7.1 with 4.0.x kernel. Is starting OSDs the issue you had? I'm surprised that min_free_kbytes wouldn't help in this situation. Is there something else you found with too much page cache?
I'm not sure but I think min_free_kbytes doesn't help all kinds of allocations.
Yes, the issue is starting the OSD.
It works much better with KVM - If I have 100G "pagecached" memorry and start a 64GB VM it just cleans the memory fast and starts.
If I have to start a 2GB OSD it struggles to get the memory and lags horribly.
I think the main difference is that KVM allocates the whole amount of memory at once and kernel cleans that in one sweep, while OSD allocates small blocks (small order allocs) so even if it got its' memory from the min_free_kbytes pool it would have to wait for kswapd to punch the hole every time this memory drops down. Btw disabling tcmalloc helped in this case, something's rotten in here...

Interesting, I haven't run into that problem. I have noticed that Ceph application pages of memory would be swapped out for page cache and it would cause problems with shutting down OSDs as the memory would have to be swapped back in. Since we have the OS on SATADOM, we just disabled swap all together. We haven't had a problem with shutting down OSDs since then or starting them up. I might have to do some testing to see if I can replicate this.

> I agree about the separate partition, maybe it was a problem with the SSD cache I don't remember the specifics. Your suggestion on disabling barriers peaked my interest. Initially we had barriers disabled, but since we don't have battery backed controllers we backed that setting out. Are you suggesting disabling barriers in all cases? I'd like to discuss the pros/cons of this option.
Just for the record - barriers don't really exist anymore, they were replaced by FUA and explicit flushes (though the effect should be the same).

You can disable barriers:
1) if your drives or controllers have non-volatile cache
2) if your filesystem is crash consistent without them (journal checkpoints, replays - ext4+journal_async_commit/journal_checksum might do it but I spent some time reading through LKML and get the feeling this is guesswork and wishful thinking on the devs part)
3) if you are 10000% sure all your nodes will not crash all at once. We have 3 datacenters few kms apart with one replica in each, in this case I could disable barriers safely. Care must be taken when the node crashes and you start it again - the filesystem might be corrupted even though the system thinks it's clean.

Even with barriers there are some drives (consumer SATA drivers and some SSDs) that will not actually flush the data even though they don't have a capacitor. Having those drives in the cluster is a gamble and you should disable write cache on them - sometimes that also doesn't work...

In my case with Kingstons, I can either have them perform at 98 IOPS bottom line (but observed with real workload!!!) or disable barriers and then they perform ~3K IOPS. I only do that now when recovering and only on one node at a time. Hopefully I'll be Kingston-free in two weeks and can sleep well again.

TL;DR: If you have drives that work well enough with barriers enabled, don't disable them. Even with barriers enabled some assumptions the filesystems make are incorrect - for example the writes are not guaranteed to be atomic and are reordered. The only real solution is using filesystem with checksums like ZFS and more replicas or you will lose data to bad behaving caches or bit-rot.

I think that you've just confirmed my suspicions. Too many things have to go right to disable barriers. With the Intels, it don't cause that much pain so I'm just going to leave it on. It would be nice for btrfs to get all the bugs shook out to be able to detect bit rot and have a modern file system overall.

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV4IhfCRDmVDuy+mK58QAAAdAP/RM/gEukaQUpaiRyR14t
AOIZcKY+5vRSct0pBEVsBuXENRdnZk2miBxt4RhFuRzwh8RGxxpdkqVci0Cf
K4wH5MiMkklLskzFkekG0gitIh+DuY+dVWp6GFE7TnDPAyKZ/2gnMFinsx8C
tZwPIM1GUETN7nNZcWcE0+QrSanxZBESva8f0uJN3vaFxnO455sjniluj79S
mLQMkEJknG+os8rXC51V3KNp11q33Gdo1Wkb//0KmYg63zk20V/wJ+h6dwH2
paeOc6su/dKq0Dv8WyU3yTp1RPHU+HGZrIFLxp/PlJUAAW4R1N6uKv8yy0Kh
ecbfSre2HWVbB0qjxekLmol2xw8obOVWthYVw2iGq03MdC1+brqIgvmEia5E
vGhdSrQC9IHfHqcdQvicSbWAa7Hzm+ZhoAGEamW28lRzjeUI1FQHAiQa+Awu
KjbLt8Tc2xBhJhUMwtVyBAtL+gh0pmpUI+IaEa9zsBCJ6Hlv0IEaEU3FYqXH
lGts/OntkEJe7nMJM8BgmZzyhDCQwBe2ZokAvUQit/nhS6LJhXka7l3fut/b
OXwN18HWI/gPia9Q69CZEuDbwWS6mRO9eulybYT/TaSpUu1C3HPtpHb7DPGL
40uLIgKSBkD0Bszu8+smE0xbB5jaR+Twvjz8exitRCfYUhzkOHAelPPSY0n/
4emS
=P+gj
-----END PGP SIGNATURE-----

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com