Le 20/05/2015 18:50, Glyn Astill a écrit :
From: Thomas SIMON <tsimon@xxxxxxxxxxx>
To: glynastill@xxxxxxxxxxx
Cc: "pgsql-admin@xxxxxxxxxxxxxx" <pgsql-admin@xxxxxxxxxxxxxx>
Sent: Wednesday, 20 May 2015, 16:41
Subject: Re: Performances issues with SSD volume ?
Hi Glyn,
I'll try to answer this points.
I've made some benchs, and indeed 3.2 not helping. not helping at all.
I changed to 3.14 and gap is quite big !
With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in same
conditions
With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, same
conditions too.
That's a start then.
It should so be better, but when server was in production, and ever with
bad kernel, performances was already quite good before they quickly
decreased.
So i think too I have another configuration problem.
You say you're IO bound, so some output from sar / iostat / dstat and
pg_stat_activity etc before and during the issue would be of use.
-> My server is not in production right now, so it is difficult to
replay production load and have some useful metrics.
The best way I've found is to replay trafic from logs with pgreplay.
I hoped that the server falls back by replaying this traffic, but it
never happens ... Another thing I can't understand ...
Below is my dstat output when I replay this traffic (and so when server
runs normally)
I have unfortunately no more outputs when server's performances decreased.
It's a shame we can't get any insight into activity on the server during the issues.
Other things you asked
System memory size : 256 Go
SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model
INTEL SSDSC2BB480G4
Raid controller : MegaRAID SAS 2208
Partition alignments and stripe sizes : see fdisk delow
Kernel options : the config file is here :
ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64
Filesystem used and mount options : ext4, see mtab below
IO Scheduler : noop [deadline] cfq for my ssd raid volume
Postgresql version and configuration : 9.3.5
max_connections=1800
shared_buffers=8GB
temp_buffers=32MB
work_mem=100MB
maintenance_work_mem=12GB
bgwriter_lru_maxpages=200
effective_io_concurrency=4
wal_level=hot_standby
wal_sync_method=fdatasync
wal_writer_delay=2000ms
commit_delay=1000
checkpoint_segments=80
checkpoint_timeout=15min
checkpoint_completion_target=0.7
archive_command='rsync ....'
max_wal_senders=10
wal_keep_segments=38600
vacuum_defer_cleanup_age=100
hot_standby = on
max_standby_archive_delay = 5min
max_standby_streaming_delay = 5min
hot_standby_feedback = on
random_page_cost = 1.0
effective_cache_size = 240GB
log_min_error_statement = warning
log_min_duration_statement = 0
log_checkpoints = on
log_connections = on
log_disconnections = on
log_line_prefix = '%m|%u|%d|%c|'
log_lock_waits = on
log_statement = 'all'
log_timezone = 'localtime'
track_activities = on
track_functions = pl
track_activity_query_size = 8192
autovacuum_max_workers = 5
autovacuum_naptime = 30s
autovacuum_vacuum_threshold = 40
autovacuum_analyze_threshold = 20
autovacuum_vacuum_scale_factor = 0.10
autovacuum_analyze_scale_factor = 0.10
autovacuum_vacuum_cost_delay = 5ms
default_transaction_isolation = 'read committed'
max_locks_per_transaction = 128
Connection pool sizing (pgpool2)
num_init_children = 1790
max_pool = 1
1800 is quite a lot of connections, and with max_pool=1 in pgpool you're effectively just using pgpool as a proxy (as I recall, my memory is a little fuzzy on pgpool now). Unless your app is stateful in some way or has unique users for each of those 1800 connections you should lower the quantity of active connections. A general starting point is usually cpu cores * 2, so you could up max_pool and divide num_init_children by the same amount.
Hard to say what you need to do without knowing what exactly you're doing though. What's the nature of the app(s)?
Yes, we just use it as a proxy for now.
We have approximately 100 different active users, doing for all of then
various number of connexions (twisted + zope apps)
result is ~ 900 idle connexions for ~ 60 active connexions, but
sometimes (when stopping/starting prod), we need almost double of
connexion because some twisted services don't stop their connexions
immediatly.
But this is the actual (working) configuration, and I don't think think
my performance disk is related to this.
I also add megacli parameters :
Virtual Drive: 2 (Target Id: 2)
Name :datassd
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 893.25 GB
Sector Size : 512
Is VD emulated : Yes
Mirror Data : 893.25 GB
State : Optimal
Strip Size : 256 KB
Number Of Drives per span:2
Span Depth : 2
Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write
Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write
Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Enabled
Encryption Type : None
Bad Blocks Exist: No
PI type: No PI
Is VD Cached: No
Not using your raid controllers write cache then? Not sure just how important that is with SSDs these days, but if you've got a BBU set it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache if Bad BBU" if you do that.
No, I had read some megacli related docs about SSD, and the advice was
to put writethrough on disks. (see
http://wiki.mikejung.biz/LSI#Configure_LSI_Card_for_SSD_RAID), last section.
Disks are already in "No Write Cache if Bad BBU" mode. (wrote on
splitted line on my extract)
Other outputs :
fdisk -l
Disk /dev/sdc: 959.1 GB, 959119884288 bytes
255 heads, 63 sectors/track, 116606 cylinders, total 1873281024 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/mapper/vg_datassd-lv_datassd: 751.6 GB, 751619276800 bytes
255 heads, 63 sectors/track, 91379 cylinders, total 1468006400 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
cat /etc/mtab
/dev/mapper/vg_datassd-lv_datassd /datassd ext4
rw,relatime,discard,nobarrier,data=ordered 0 0
(I added nobarrier option)
cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq
You could swap relatime for noatime,nodiratime.
I'll swap to noatime, thanks.
sysctl kernel | grep sched
kernel.sched_child_runs_first = 0
kernel.sched_rr_timeslice_ms = 25
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
I've read some advices about changing kernel.sched_autogroup_enabled=0
and kernel.sched_migration_cost_ns=5000000, but this parameters are not
recognized by my kernel. So I don't know what to do with that...
sched_migration_cost_ns would be called sched_migration_cost in your old 3.2 kernel, not sure why
sched_autogroup_enabled wouldn't be recognized though.
I've found on config file that "CONFIG_SCHED_AUTOGROUP is not set" in
this kernel.
So i guess it is the same thing as if it was enabled=0 , right ?
I haven't found any parameter related to migration_sost in config file.
sysctl vm
vm.admin_reserve_kbytes = 8192
vm.block_dump = 0
vm.dirty_background_bytes = 8388608
vm.dirty_background_ratio = 0
vm.dirty_bytes = 67108864
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 0
vm.dirty_writeback_centisecs = 500
vm.drop_caches = 3
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256 256 32
vm.max_map_count = 65530
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 65008
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 4096
vm.nr_pdflush_threads = 0
vm.numa_zonelist_order = default
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 2
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.panic_on_oom = 0
vm.percpu_pagelist_fraction = 0
vm.scan_unevictable_pages = 0
vm.stat_interval = 1
vm.swappiness = 60
vm.user_reserve_kbytes = 131072
vm.vfs_cache_pressure = 100
vm.zone_reclaim_mode = 0
select * from pg_stat_activity
I've got hundred of entries for that when i'm in production, and I
can't
paste it here due to confidentiality.
By day, it is around 50 millions queries usually. (35% selects ; 55%
updates & 5% inserts)
lspci | grep -E 'RAID|SCSI|IDE|SATA'
00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset
6-Port SATA AHCI Controller (rev 06)
02:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208
[Thunderbolt] (rev 05)
07:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset
4-Port SATA Storage Control Unit (rev 06)
Thanks
Thomas
Le 18/05/2015 16:29, Glyn Astill a écrit :
From: Koray Eyidoğan <korayey@xxxxxxxxx>
To: Thomas SIMON <tsimon@xxxxxxxxxxx>
Cc: pgsql-admin@xxxxxxxxxxxxxx
Sent: Monday, 18 May 2015, 14:51
Subject: Re: Performances issues with SSD volume ?
Hi Thomas,
3.2 kernel may be #1 cause of your I/O load problem:
http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html
https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4
Have a nice day.
Koray
Likely 3.2 kernel isn't helping, but I think we need much more
information before jumping to conclusions.
You say you're IO bound, so some output from sar / iostat / dstat and
pg_stat_activity etc before and during the issue would be of use.
Also:
System memory size
SSD Model numbers and how many
Raid controller
Partition allignments and stripe sizes
Kernel options
Filesystem used and mount options
IO Scheduler
Postgresql version and configuration
Connection pool sizing
Perhaps you could thow us the output of some of these:
fdisk -l
cat /etc/mtab
cat /sys/block/<ssd device>/queue/scheduler
sysctl kernel | grep sched
sysctl vm
select * from pg_stat_activity
select name, setting from pg_settings
lspci | grep -E 'RAID|SCSI|IDE|SATA'
--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin