I have a large database, 1.3 TB, with quite a bit of write activity. The machine has, 2 cpus x 6 cores x 2 threads (2 x E5-2630 v2 @ 2.60GHz), 4 x EVO Pro 2TB SSDs in a RAID 1+0 software raid configuration, on a SATA 3 controller. The machine has a lot of memory, 384 GB, so it doesn’t do a lot of reads, as most of the frequently accessed data is cached. Reads are between 5-10 MB/s on each SSD, giving a total of about 20-30 MB/s for the entire RAID volume. Each SSD writes on average 30-40 MB/s, so overall the RAID writes on 2 mirrors at 60-80 MB/s each, so 120-160 MB/s total writes. According to AnandTech benchmarks, those SSDs should be able to do around 90 MB/s random reads and 250 MB/s random writes at 4KB block size, so in theory I shouldn’t hit I/O limits even if all I/O was random and very small. I estimate the SSDs should be about 10% busy from the reads and 15% busy from the writes, so maybe 25% total. However sar -d shows %util on those SSDs between 70% and 80% under the normal daily load. But with SSDs, you can’t really trust fully the %util reported by sar, because SSDs perform best at large queue sizes. Average queue size reported by sar is between 5 and 10, so I’m not entirely sure if I should trust it or not for the %util values. A queue size of 8 and larger should be enough to keep most SSDs busy, but in that case I would expect %util to be 100% all the time. Still, based on what sar is showing, I am concerned about hitting an I/O performance wall as the activity on that database might increase in the future, and I was trying to determine the actual limits, especially for writes. While doing that, I started to investigate the checkpoint behavior. The database is configured with: - the default checkpoint timeout interval of 5 minutes (not set explicitly). It’s probably not long enough, I plan to increase it to something like 30 minutes, or more. - checkpoint_warning with the default at 30 seconds. Not sure if that is measured between the end of the previous checkpoint and the start of the next one, or between the start times (or end times) of two consecutive checkpoints. - completion target of 0.85, so it should write for 4:15 minutes every 5 minutes. - max_wal_size = 16 GB, I plan to increase it to something like 96 GB. - 96 GB of shared buffers. I may increase them to something like 144 GB - bgwriter with 400 max pages and the default 200ms delay. That should result in 400 x 5 x 8kb pages / second max, so 16 MB/s max from the bgwriter. It’s probably too much and I should have left it to the default value of 100 pages, or even lower. In fact I’m not even sure why I even need the bgwriter process instead of using just checkpoints for writes. In any case, even with 400 pages, it should still leave some work for the checkpoint process. - wal_buffers of 32 MB, I had them as high as 256 MB as some point, but it seems values larger than 16MB are useless, I will reduce it to 16MB. The log of the last checkpoints looks like this when the machine is idle in the morning: 2017-10-04 08:43:00.283 EDT [7831] LOG: checkpoint starting: time 2017-10-04 08:47:15.887 EDT [7831] LOG: checkpoint complete: wrote 19290 buffers (0.2%); 0 transaction log file(s) added, 4 removed, 0 recycled; write=254.787 s, sync=0.794 s, total=255.604 s; sync files=274, longest=0.033 s, average=0.002 s; distance=61864 kB, estimate=322398 kB So it does take the 5 minutes and 0.85% seconds into account well. 300 x 0.85 = 255 seconds, and it did indeed spread the writes over 255 seconds. However, later during the day, as the activity increases it starts to look like this: 2017-10-04 08:48:00.932 EDT [7831] LOG: checkpoint starting: time 2017-10-04 08:55:27.179 EDT [7831] LOG: checkpoint complete: wrote 635481 buffers (5.1%); 0 transaction log file(s) added, 4 removed, 0 recycled; write=442.133 s, sync=4.003 s, total=446.246 s; sync files=445, longest=0.118 s, average=0.008 s; distance=5227434 kB, estimate=5227434 kB First, a checkpoint finished in 446 seconds instead of 255 seconds. 635481 buffers in 446 seconds, 1425 buffers/second, around 12 MB/second. I would say that’s not such a huge amount of I/O for the hardware. Then the next starts immediately: 2017-10-04 08:55:27.180 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 09:05:23.391 EDT [7831] LOG: checkpoint complete: wrote 2090424 buffers (16.6%); 0 transaction log file(s) added, 319 removed, 0 recycled; write=595.690 s, sync=0.468 s, total=596.211 s; sync files=454, longest=0.020 s, average=0.001 s; distance=11415997 kB, estimate=11415997 kB This one takes even longer 596 seconds. 2090424 buffers, 3500 buffers/second, around 29 MB/second. That should still not be such a big deal, it would mean writing at around 15 MB/s on each SSD just for checkpoints. So, not sure what is going on here. I could understand that the checkpoint process is surprised by the sudden increase in writes at the beginning of the day, and the first 2-3 checkpoints would take more than the configured duration. But even the next checkpoints follow the same pattern of taking more than 255 seconds, and sometimes more than 300 seconds: 2017-10-04 09:05:23.392 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 09:12:54.799 EDT [7831] LOG: checkpoint complete: wrote 1342046 buffers (10.7%); 0 transaction log file(s) added, 697 removed, 0 recycled; write=450.554 s, sync=0.791 s, total=451.406 s; sync files=706, longest=0.014 s, average=0.0 01 s; distance=8478044 kB, estimate=11122202 kB 2017-10-04 09:12:54.801 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 09:17:49.897 EDT [7831] LOG: checkpoint complete: wrote 1220559 buffers (9.7%); 0 transaction log file(s) added, 376 removed, 141 recycled; write=292.734 s, sync=0.677 s, total=295.096 s; sync files=690, longest=0.026 s, average=0. 000 s; distance=8551073 kB, estimate=10865089 kB 2017-10-04 09:17:54.903 EDT [7831] LOG: checkpoint starting: time 2017-10-04 09:22:19.092 EDT [7831] LOG: checkpoint complete: wrote 864467 buffers (6.9%); 0 transaction log file(s) added, 213 removed, 309 recycled; write=258.673 s, sync=1.379 s, total=264.188 s; sync files=672, longest=0.034 s, average=0.0 02 s; distance=5986866 kB, estimate=10377266 kB 2017-10-04 09:22:54.127 EDT [7831] LOG: checkpoint starting: time 2017-10-04 09:27:22.678 EDT [7831] LOG: checkpoint complete: wrote 962442 buffers (7.6%); 0 transaction log file(s) added, 176 removed, 190 recycled; write=265.684 s, sync=0.603 s, total=268.550 s; sync files=689, longest=0.018 s, average=0.0 00 s; distance=6278971 kB, estimate=9967437 kB 2017-10-04 09:27:22.679 EDT [7831] LOG: checkpoint starting: xlog 2017-10-04 09:32:08.082 EDT [7831] LOG: checkpoint complete: wrote 1206510 buffers (9.6%); 0 transaction log file(s) added, 200 removed, 183 recycled; write=282.610 s, sync=0.622 s, total=285.402 s; sync files=683, longest=0.013 s, average=0.000 s; distance=7412886 kB, estimate=9711982 kB 2017-10-04 09:32:22.097 EDT [7831] LOG: checkpoint starting: time 2017-10-04 09:36:51.049 EDT [7831] LOG: checkpoint complete: wrote 977391 buffers (7.8%); 0 transaction log file(s) added, 218 removed, 234 recycled; write=264.946 s, sync=0.939 s, total=268.952 s; sync files=671, longest=0.032 s, average=0.001 s; distance=6576530 kB, estimate=9398437 kB 2017-10-04 09:37:09.023 EDT [7831] LOG: checkpoint starting: xlog 2017-10-04 09:41:41.421 EDT [7831] LOG: checkpoint complete: wrote 1051219 buffers (8.4%); 0 transaction log file(s) added, 245 removed, 157 recycled; write=269.147 s, sync=0.996 s, total=272.398 s; sync files=699, longest=0.026 s, average=0.001 s; distance=6699634 kB, estimate=9128556 kB 2017-10-04 09:41:41.423 EDT [7831] LOG: checkpoint starting: xlog 2017-10-04 09:46:36.460 EDT [7831] LOG: checkpoint complete: wrote 1238858 buffers (9.8%); 0 transaction log file(s) added, 260 removed, 149 recycled; write=292.544 s, sync=0.833 s, total=295.037 s; sync files=716, longest=0.033 s, average=0.001 s; distance=7551236 kB, estimate=8970824 kB 2017-10-04 09:46:36.462 EDT [7831] LOG: checkpoint starting: xlog 2017-10-04 09:51:03.839 EDT [7831] LOG: checkpoint complete: wrote 1038125 buffers (8.3%); 0 transaction log file(s) added, 177 removed, 283 recycled; write=263.307 s, sync=0.718 s, total=267.377 s; sync files=708, longest=0.027 s, average=0.001 s; distance=6799399 kB, estimate=8753682 kB 2017-10-04 09:51:36.871 EDT [7831] LOG: checkpoint starting: time 2017-10-04 09:56:30.606 EDT [7831] LOG: checkpoint complete: wrote 913761 buffers (7.3%); 0 transaction log file(s) added, 152 removed, 263 recycled; write=288.352 s, sync=0.933 s, total=293.735 s; sync files=697, longest=0.027 s, average=0.001 s; distance=5756934 kB, estimate=8454007 kB 2017-10-04 09:56:30.607 EDT [7831] LOG: checkpoint starting: xlog 2017-10-04 10:02:44.340 EDT [7831] LOG: checkpoint complete: wrote 1080696 buffers (8.6%); 0 transaction log file(s) added, 189 removed, 163 recycled; write=371.377 s, sync=0.569 s, total=373.732 s; sync files=712, longest=0.014 s, average=0.000 s; distance=6734628 kB, estimate=8282069 kB 2017-10-04 10:02:44.341 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 10:10:13.903 EDT [7831] LOG: checkpoint complete: wrote 1200040 buffers (9.5%); 0 transaction log file(s) added, 360 removed, 51 recycled; write=448.221 s, sync=0.760 s, total=449.561 s; sync files=696, longest=0.021 s, average=0.001 s; distance=7387465 kB, estimate=8192609 kB 2017-10-04 10:10:13.905 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 10:16:33.351 EDT [7831] LOG: checkpoint complete: wrote 1472230 buffers (11.7%); 0 transaction log file(s) added, 401 removed, 50 recycled; write=377.834 s, sync=0.979 s, total=379.445 s; sync files=700, longest=0.028 s, average=0.001 s; distance=8555349 kB, estimate=8555349 kB 2017-10-04 10:16:33.352 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 10:21:58.810 EDT [7831] LOG: checkpoint complete: wrote 1250086 buffers (9.9%); 0 transaction log file(s) added, 378 removed, 144 recycled; write=322.737 s, sync=0.888 s, total=325.457 s; sync files=680, longest=0.031 s, average=0.001 s; distance=7402273 kB, estimate=8440041 kB 2017-10-04 10:21:58.811 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 10:27:09.544 EDT [7831] LOG: checkpoint complete: wrote 1171682 buffers (9.3%); 0 transaction log file(s) added, 264 removed, 188 recycled; write=307.983 s, sync=0.796 s, total=310.732 s; sync files=688, longest=0.026 s, average=0.001 s; distance=7018008 kB, estimate=8297838 kB 2017-10-04 10:27:09.546 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 10:34:16.507 EDT [7831] LOG: checkpoint complete: wrote 1088387 buffers (8.6%); 0 transaction log file(s) added, 343 removed, 85 recycled; write=425.445 s, sync=0.480 s, total=426.961 s; sync files=670, longest=0.012 s, average=0.000 s; distance=6695430 kB, estimate=8137597 kB Then as the activity on the database increases even more, and checkpoints start to take more than 10 minutes: 2017-10-04 10:34:16.509 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 10:45:29.233 EDT [7831] LOG: checkpoint complete: wrote 1445250 buffers (11.5%); 1 transaction log file(s) added, 409 removed, 0 recycled; write=671.755 s, sync=0.832 s, total=672.724 s; sync files=687, longest=0.039 s, average=0.001 s; distance=8688405 kB, estimate=8688405 kB 2017-10-04 10:45:29.235 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 11:04:48.081 EDT [7831] LOG: checkpoint complete: wrote 1929827 buffers (15.3%); 1 transaction log file(s) added, 530 removed, 0 recycled; write=1156.147 s, sync=2.487 s, total=1158.846 s; sync files=704, longest=0.046 s, average=0.003 s; distance=12851319 kB, estimate=12851319 kB Then more than 20 minutes: 2017-10-04 11:04:48.082 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 11:25:55.806 EDT [7831] LOG: checkpoint complete: wrote 2695524 buffers (21.4%); 0 transaction log file(s) added, 784 removed, 0 recycled; write=1257.586 s, sync=10.070 s, total=1267.723 s; sync files=709, longest=0.117 s, average=0.014 s; distance=20710817 kB, estimate=20710817 kB 2017-10-04 11:25:55.807 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 11:48:08.366 EDT [7831] LOG: checkpoint complete: wrote 2857107 buffers (22.7%); 0 transaction log file(s) added, 1264 removed, 0 recycled; write=1326.199 s, sync=6.233 s, total=1332.558 s; sync files=726, longest=0.076 s, average=0.008 s; distance=24439493 kB, estimate=24439493 kB 2017-10-04 11:48:08.368 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 12:11:39.550 EDT [7831] LOG: checkpoint complete: wrote 3127276 buffers (24.9%); 0 transaction log file(s) added, 1492 removed, 0 recycled; write=1403.735 s, sync=7.293 s, total=1411.182 s; sync files=732, longest=0.071 s, average=0.009 s; distance=27967471 kB, estimate=27967471 kB 2017-10-04 12:11:39.552 EDT [7831] LOG: checkpoint starting: xlog time 2017-10-04 12:32:28.853 EDT [7831] LOG: checkpoint complete: wrote 2823207 buffers (22.4%); 0 transaction log file(s) added, 1707 removed, 0 recycled; write=1236.280 s, sync=12.828 s, total=1249.300 s; sync files=739, longest=0.133 s, average=0.017 s; distance=28321252 kB, estimate=28321252 kB There are no “checkpoints are occurring too frequently” messages in the logs, and since a checkpoint starts as soon as the previous one ends, not in over 30 seconds as I configured the warning, in my opinion there should have been warnings, but there aren’t. Probably the warning measures start to start, or end to end, not end to start. The lack of a warning made me even more confused and made me lose a lot of time trying to figure this out, until I noticed the long durations. In any case, I still don’t fully understand what is happening here. It looks like I’m hitting some hardware limits, but it’s not obvious. The CPU usage is around 20% user, 4% system, and Linux also reports 10-15% iowait, so the machine looks a bit stressed on the I/O side and a bit less on the CPU side, but again, I can’t be sure. The longest checkpoint today at 11:48 AM took 1411 seconds and has written 3127276 buffers, so 2216 buffers/second. That’s just 18 MB/second, so 9 MB/second on each SSD from checkpoints. In fact earlier in the day it had a shorter checkpoint that did 29 MB/second, so I think this larger one could have completed faster. The sar disk I/O report was looking like this during this 1411 long checkpoint: 11:55:01 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 12:05:01 PM dev8-16 2478.72 17412.49 64107.22 32.89 7.02 2.83 0.31 77.79 12:05:01 PM dev8-0 2139.80 10546.02 64153.14 34.91 7.16 3.34 0.35 75.27 12:05:01 PM dev8-32 2155.46 10881.22 64075.18 34.78 6.84 3.17 0.35 75.08 12:05:01 PM dev8-48 2497.76 17701.13 64153.14 32.77 7.90 3.16 0.31 77.52 12:05:01 PM dev9-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:05:01 PM dev9-1 5873.69 60126.95 126219.54 31.73 0.00 0.00 0.00 0.00 9-1 is the RAID, and 8-0, 8-16, 8-32, 8-48 the SSDs. So around 32 MB/s writes and 7 MB/s reads on each SSD. In theory the SSDs should be easily able to handle that level of I/O, so I’m suspecting something else might be slowing the checkpoints. Could there be some inefficiency or bug in the checkpoint process making it not work as fast as it should? Or something I may have configured improperly triggering this behavior? Thanks, Vlad |