Jay Manni wrote:
The data in the stream is in the form of variable length records with clearly defined fields ? so it can be stored in a database or in a file. The only caveat is that the rate of records coming in the stream could be several 1000 records a second.
There's a few limits to be concerned about here, some physical, some related to your application design. A few thousand records is possible with either the right software design or some hardware assist. The wall where it gets increasingly difficult to keep up is closer to 10K/second, presuming your records aren't particularly wide. Some background: when you commit a record in PostgreSQL, by default that transaction doesn't complete until data has been physically written to disk. If you take a typical 7200 RPM disk, that spins 120 times/second, meaning that even under the best possible conditions there can only be 120 commits per second to physical disk. However, that physical commit can contain more than one record. Here are the common ways to increase the number of records you can insert per second: 1) Batch up inserts. Turn off any auto-commit behavior in your client, insert a bunch of records, issue one COMMIT. Typical popular batch sizes are in the 100-1000 records/commit range. If individual records aren't very wide, you can easily get a huge speedup here. Hard to estimate how much this will help in your case without knowing more about that width and the speed of your underlying disks; more on that below. 2) Have multiple clients committing at once. Typically I see this give at most about a 5X speedup, so on a slow disk with single record commits you might hit 600/s instead of 120/s if you had 10 clients going at once. 3) Use a RAID controller with a battery-backed cache. This will hold multiple disk commits in its cache and dump them onto disk in larger chunks transparently, with only a small risk of corruption if there's an extended power outage longer than the battery lasts. Typically I'll see this increase commit rate to the 1000-10,000 commits/second range, again depending on underlying disk speed and row size. This approach really reduces the worst-case behavior disks can get into, which is where you keep seeking between two spots writing small bits at each one. 4) Turn off synchronous_commit. This lets you adjust the rate at which records get committed into larger chunks without touching your application or hardware. It does introduce the possibility you might lose some records if there's a crash in the middle of loading or changing things. Adjusting the commit period here upwards makes this case look similar to (1), you're basically committing in larger chunks but the app just doesn't know it. Basically, (2) alone is probably not enough to reach 1,000 per second. But (1) or (3) is, as is (4) if you can take the potential data integrity issues if there's a crash. If your batchs get bigger via any of these techniques, what should end up happening is that you push the bottleneck to somewhere else, like disk write or seek speed. Which of those you'll run into depends on how interleaved these writes are with application reads and the total disk bandwidth. To close, here's a quick example showing the sort of analysis you should be doing to better estimate here. Imagine you're trying to write 10,000 records/second. Each record is 100 bytes wide. That works out to be almost 1MB/s of steady disk writes. In the real world, a single cheap disk can't do much better than this if those writes involve heavy seeking around the disk. And that's happens in a database, because at a minimum you need to write to both the main database area and the write-ahead log. If your records are 1,000 records wide instead, you might hit the upper limit of your disk seeking capability at only 1,000/second. Whereas if you have an app that's just writing to a file, you wouldn't necessarily expect that to regularly seek elsewhere. That means it's much likely that you'd hit >10MB/s on writes rather than the 1-2MB/s worst-case behavior when seeking. Of course, as you've already noted, you end up paying penalties on reporting instead if you do that. The best you can do here is to try and measure your application and estimate/simulate larger volume, then see what happens if you apply one or more of these techniques. -- Greg Smith greg@xxxxxxxxxxxxxxx Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance