Re: How do I track down a possible locking problem?

Herouth Maoz <herouth@xxxxxxxxxxxxx> · Tue, 18 Feb 2014 17:13:51 +0200

Is there a more appropriate place to ask this question? Or was my question unclear?

I dug some data, and it seems that whenever messages come at a rate of 75,000 per hour, they start picking delays of up to 10 minutes. If I go up to 100,000, delays pick up to about 20 minutes. And for 300,000 in one hour, I get delays of up to 3 hours or so.

Typically, during an hour in which 250,000 messages were processed, around 10% of them are delayed more than 20 minutes, and some for more than an hour.

Can anybody offer any insight? Do any of you get delays in inserts performed at this rate (250,000 per hour), meaning that the insert transaction takes more than 10 minutes? Is it a matter of fine-tuning the server? Note that at the same time there are other processes who also perform updates on the same tables, at about the same rate. So each of the tables gets a write about 500,000 times per hour. The table normally contains around 2-3 million records, and has 3 indexes.

Thank you,
Herouth

On 17/02/2014, at 18:45, Herouth Maoz wrote:

I have a production system using Postgresql 9.1.2.

The system basically receives messages, puts them in a queue, and then several parallel modules, each in its own thread, read from that queue, and perform two inserts, then release the message to the next queue for non-database-related processing.

Today, after we received complaints from a customer about delays, I noticed odd behavior in the system. About 2 percent of the messages were inserted into the tables more than an hour after they got into the system.

The queue never has more than 27,000 messages at the same time, and all together, the parallel modules process about 5000 or 6000 messages per minute. So basically, the delay for a single message should never be more than a few minutes. Even if one module gets stuck, another will grab the next message from the queue. I believe the only way for a message to be stuck for so long would be for it to be grabbed by a module, and then for the database write to be somehow delayed for a hour, although it's merely a simple insert performed with a prepared statement.

The database in production is very busy with millions of writes per hour. Could there be a situation in which a particular connection gets "starved" while other connections are able to run queries without noticeable delay?

How can I truck such locks down? Does anybody have any ideas other than starvation? The system lets me view statistics of how many messages were processed in each modules and the average latency. None of the four modules running has long average latency or low number of messages processes, so I don't think the issue is related to any particular thread in my (Java) system being slow or locked away by the others.

TIA,
Herouth