Running into a strange issue that just popped up on a few servers in my environment and was wondering if the community had any insight into as to what could be causing the issue.
First, a bit of background. I am running Postgres 10.11 on Windows (but have seen similar issue on a server running 11.6)
Windows Version:
Major Minor Build Revision
----- ----- ----- --------
10 0 14393 0
I have the following query that was on average running in ~2ms suddenly jump up to on average ~25ms. This query is called millions of time per day and there were cases of the query taking 20-30 seconds. Below is the explain analyze of one such example.
When seeing this issue, the server was under some CPU pressure but even with that, I would not think it should get as slow as shown below as we are using SSDs and none of the windows disk counters (IOPS, queue length) show any value that would be of concern.
explain (analyze,buffers) SELECT tabledata.uuid_id,tabledata.int_id,tabledata.timestamp_date,tabledata.int_otherid,tabledata.float_value,tabledata.int_otherid2,tabledata.int_otherid3,tabledata.int_rowver
FROM tabledata WHERE timestamp_date <= $1 AND int_otherid3 IN ($2,$3,$4,$5,$6,$7) AND tabledata.int_id=$8 ORDER BY timestamp_date DESC LIMIT 1
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
First, a bit of background. I am running Postgres 10.11 on Windows (but have seen similar issue on a server running 11.6)
Windows Version:
Major Minor Build Revision
----- ----- ----- --------
10 0 14393 0
I have the following query that was on average running in ~2ms suddenly jump up to on average ~25ms. This query is called millions of time per day and there were cases of the query taking 20-30 seconds. Below is the explain analyze of one such example.
When seeing this issue, the server was under some CPU pressure but even with that, I would not think it should get as slow as shown below as we are using SSDs and none of the windows disk counters (IOPS, queue length) show any value that would be of concern.
explain (analyze,buffers) SELECT tabledata.uuid_id,tabledata.int_id,tabledata.timestamp_date,tabledata.int_otherid,tabledata.float_value,tabledata.int_otherid2,tabledata.int_otherid3,tabledata.int_rowver
FROM tabledata WHERE timestamp_date <= $1 AND int_otherid3 IN ($2,$3,$4,$5,$6,$7) AND tabledata.int_id=$8 ORDER BY timestamp_date DESC LIMIT 1
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.71..139.67 rows=1 width=84) (actual time=17719.076..17719.077 rows=1 loops=1)
Buffers: shared hit=12102 read=13259 written=111
-> Index Scan Backward using ix_tabledata_intid_timestampdate_intotherid3_intotherid2 on tabledata (cost=0.71..2112828.54 rows=15204 width=84) (actual time=17719.056..17719.057 rows=1 loops=1)
Index Cond: ((int_id = 8149) AND (timestamp_date <= '2020-02-24 03:05:00.013'::timestamp without time zone))
Filter: (int_otherid3 = ANY ('{3ad2b707-a068-42e8-b0f2-6c8570953760,4e1b1bfa-34e1-48df-8cf8-2b59caf076e2,00d394dd-c2f4-4f3a-a8d4-dc208dafa686,baa904a6-8302-4fa3-b8ae-8adce8fe4306,3c99d61b-21a1-42ea-92a8-3cc88d79f3f1,befe0f8b-5911-47b3-bfae-faa9f8b09d08}'::uuid[]))
Rows Removed by Filter: 91686
Buffers: shared hit=12102 read=13259 written=111
Planning time: 203.153 ms
Execution time: 17719.200 ms
(9 rows)
If I look at pg_stat_activity while the query is running all of the calls to this query have the same wait event.
wait_event - DataFileRead
wait_event_type - IO
We took a perfview during the issue and below is the call stack from a process running this query, two call paths are shown.
---------------------------------------------------------------
Name
ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!MiChargeWsles
|+ ntoskrnl!MiObtainSystemCacheView
||+ ntoskrnl!MmMapViewInSystemCache
|| + ntoskrnl!CcGetVacbMiss
|| + ntoskrnl!CcGetVirtualAddress
|| + ntoskrnl!CcMapAndCopyFromCache
|| + ntoskrnl!CcCopyReadEx
|| + ntfs!NtfsCopyReadA
|| |+ fltmgr!FltpPerformFastIoCall
|| | + fltmgr!FltpPassThroughFastIo
|| | + fltmgr!FltpFastIoRead
|| | + ntoskrnl!NtReadFile
|| | + ntdll!NtReadFile
|| | + kernelbase!ReadFile
|| | + msvcr120!_read_nolock
|| | + msvcr120!_read
|| | + postgres!PathNameOpenFile
|| | + postgres!??mdclose
|| | + postgres!ScheduleBufferTagForWriteback
|| | + postgres!InitBufTable
|| | + postgres!??PrefetchBuffer
|| | |+ postgres!index_getnext_tid
|| | | + postgres!index_fetch_heap
|| | | + postgres!ExecIndexEvalRuntimeKeys
|| | | + postgres!ExecAssignScanProjectionInfoWithVarno
|| | | + postgres!tupledesc_match
|| | | + postgres!recompute_limits
|| | | + postgres!CheckValidRowMarkRel
|| | | + postgres!list_length
|| | | + pg_stat_statements!pgss_ExecutorRun
|| | | + postgres!PortalRunFetch
|| | | + postgres!PortalStart
|| | | + postgres!exec_bind_message
|| | | + postgres!PostgresMain
|| | | + postgres!BackendInitialize
|| | | + postgres!ClosePostmasterPorts
|| | | + postgres!main
|| | | + postgres!_onexit
|| | | + kernel32!BaseThreadInitThunk
|| | | + ntdll!RtlUserThreadStart
Name
ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!MiChargeWsles
|+ ntoskrnl!MiReleaseSystemCacheView
| + ntoskrnl!MmUnmapViewInSystemCache
| + ntoskrnl!CcUnmapVacb
| + ntoskrnl!CcUnmapVacbArray
| + ntoskrnl!CcGetVirtualAddress
| + ntoskrnl!CcMapAndCopyFromCache
| + ntoskrnl!CcCopyReadEx
| + ntfs!NtfsCopyReadA
| |+ fltmgr!FltpPerformFastIoCall
| | + fltmgr!FltpPassThroughFastIo
| | + fltmgr!FltpFastIoRead
| | + ntoskrnl!NtReadFile
| | + ntdll!NtReadFile
| | |+ kernelbase!ReadFile
| | | + msvcr120!_read_nolock
| | | + msvcr120!_read
| | | + postgres!PathNameOpenFile
| | | + postgres!??mdclose
| | | + postgres!ScheduleBufferTagForWriteback
| | | + postgres!InitBufTable
| | | + postgres!??PrefetchBuffer
| | | |+ postgres!index_getnext_tid
| | | | + postgres!index_fetch_heap
| | | | + postgres!ExecIndexEvalRuntimeKeys
| | | | + postgres!ExecAssignScanProjectionInfoWithVarno
| | | | + postgres!tupledesc_match
| | | | + postgres!recompute_limits
| | | | + postgres!CheckValidRowMarkRel
| | | | + postgres!list_length
| | | | + pg_stat_statements!pgss_ExecutorRun
| | | | + postgres!PortalRunFetch
| | | | + postgres!PortalStart
| | | | + postgres!exec_bind_message
| | | | + postgres!PostgresMain
| | | | + postgres!BackendInitialize
| | | | + postgres!ClosePostmasterPorts
| | | | + postgres!main
| | | | + postgres!_onexit
| | | | + kernel32!BaseThreadInitThunk
| | | | + ntdll!RtlUserThreadStart
If I do a top down (ie from when the process started where did we spend the most time) I get:
Name
ROOT
+ Process64 postgres (16668) Args: "--forkbackend" "43216"
+ Thread (16672) CPU=9399ms
|+ ntdll!RtlUserThreadStart
||+ kernel32!BaseThreadInitThunk
|| + postgres!_onexit
|| + postgres!main
|| + postgres!ClosePostmasterPorts
|| + postgres!BackendInitialize
|| + postgres!PostgresMain
|| + postgres!exec_bind_message
|| + postgres!PortalStart
|| + postgres!PortalRunFetch
|| + pg_stat_statements!pgss_ExecutorRun
|| + postgres!list_length
|| + postgres!CheckValidRowMarkRel
|| + postgres!recompute_limits
|| + postgres!tupledesc_match
|| + postgres!ExecAssignScanProjectionInfoWithVarno
|| |+ postgres!ExecIndexEvalRuntimeKeys
|| ||+ postgres!index_fetch_heap
|| || + postgres!index_getnext_tid
|| || |+ postgres!??PrefetchBuffer
|| || ||+ postgres!InitBufTable
|| || |||+ postgres!ScheduleBufferTagForWriteback
|| || ||||+ postgres!??mdclose
|| || |||||+ postgres!PathNameOpenFile
|| || ||||||+ msvcr120!_read
|| || |||||| + msvcr120!_read_nolock
|| || |||||| |+ kernelbase!ReadFile
|| || |||||| ||+ ntdll!NtReadFile
|| || |||||| || + ntoskrnl!NtReadFile
|| || |||||| || |+ fltmgr!FltpFastIoRead
|| || |||||| || ||+ fltmgr!FltpPassThroughFastIo
|| || |||||| || |||+ fltmgr!FltpPerformFastIoCall
|| || |||||| || ||||+ ntfs!NtfsCopyReadA
|| || |||||| || |||| + ntoskrnl!CcCopyReadEx
|| || |||||| || |||| |+ ntoskrnl!CcMapAndCopyFromCache
|| || |||||| || |||| | + ntoskrnl!CcGetVirtualAddress
|| || |||||| || |||| | |+ ntoskrnl!CcUnmapVacbArray
|| || |||||| || |||| | ||+ ntoskrnl!CcUnmapVacb
|| || |||||| || |||| | |||+ ntoskrnl!MmUnmapViewInSystemCache
|| || |||||| || |||| | ||| + ntoskrnl!ExAcquireSpinLockExclusive
|| || |||||| || |||| | ||| |+ ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
Also from this same perfview the following looks to be from the checkpoint process waiting on the same lock
Name
ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!ExAcquireSpinLockExclusive
+ ntoskrnl!MiAcquireProperVm
+ ntoskrnl!MiTrimSharedPageFromViews
+ ntoskrnl!MiTrimSection
+ ntoskrnl!MmTrimSection
+ ntoskrnl!CcCoherencyFlushAndPurgeCache
+ ntfs!NtfsFlushUserStream
+ ntfs!NtfsPerformOptimisticFlush
|+ ntfs!NtfsCommonFlushBuffers
| + ntfs!NtfsCommonFlushBuffersCallout
| + ntoskrnl!KeExpandKernelStackAndCalloutInternal
| + ntfs!NtfsCommonFlushBuffersOnNewStack
| + ntfs!NtfsFsdFlushBuffers
| + fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted
| + fltmgr!FltpDispatch
| + ntoskrnl!IopSynchronousServiceTail
| + ntoskrnl!NtFlushBuffersFileEx
| + ntoskrnl!NtFlushBuffersFile
| + ntdll!NtFlushBuffersFile
| |+ kernelbase!FlushFileBuffers
| | + msvcr120!_commit
| | + postgres!FileClose
| | + postgres!mdtruncate
| | + postgres!??ReleaseBuffer
| | + postgres!CreateCheckPoint
| | + postgres!CheckpointerMain
| | + postgres!AuxiliaryProcessMain
| | + postgres!MaxLivePostmasterChildren
| | + postgres!main
| | + postgres!_onexit
| | + kernel32!BaseThreadInitThunk
| | + ntdll!RtlUserThreadStart
Buffers: shared hit=12102 read=13259 written=111
-> Index Scan Backward using ix_tabledata_intid_timestampdate_intotherid3_intotherid2 on tabledata (cost=0.71..2112828.54 rows=15204 width=84) (actual time=17719.056..17719.057 rows=1 loops=1)
Index Cond: ((int_id = 8149) AND (timestamp_date <= '2020-02-24 03:05:00.013'::timestamp without time zone))
Filter: (int_otherid3 = ANY ('{3ad2b707-a068-42e8-b0f2-6c8570953760,4e1b1bfa-34e1-48df-8cf8-2b59caf076e2,00d394dd-c2f4-4f3a-a8d4-dc208dafa686,baa904a6-8302-4fa3-b8ae-8adce8fe4306,3c99d61b-21a1-42ea-92a8-3cc88d79f3f1,befe0f8b-5911-47b3-bfae-faa9f8b09d08}'::uuid[]))
Rows Removed by Filter: 91686
Buffers: shared hit=12102 read=13259 written=111
Planning time: 203.153 ms
Execution time: 17719.200 ms
(9 rows)
If I look at pg_stat_activity while the query is running all of the calls to this query have the same wait event.
wait_event - DataFileRead
wait_event_type - IO
We took a perfview during the issue and below is the call stack from a process running this query, two call paths are shown.
---------------------------------------------------------------
Name
ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!MiChargeWsles
|+ ntoskrnl!MiObtainSystemCacheView
||+ ntoskrnl!MmMapViewInSystemCache
|| + ntoskrnl!CcGetVacbMiss
|| + ntoskrnl!CcGetVirtualAddress
|| + ntoskrnl!CcMapAndCopyFromCache
|| + ntoskrnl!CcCopyReadEx
|| + ntfs!NtfsCopyReadA
|| |+ fltmgr!FltpPerformFastIoCall
|| | + fltmgr!FltpPassThroughFastIo
|| | + fltmgr!FltpFastIoRead
|| | + ntoskrnl!NtReadFile
|| | + ntdll!NtReadFile
|| | + kernelbase!ReadFile
|| | + msvcr120!_read_nolock
|| | + msvcr120!_read
|| | + postgres!PathNameOpenFile
|| | + postgres!??mdclose
|| | + postgres!ScheduleBufferTagForWriteback
|| | + postgres!InitBufTable
|| | + postgres!??PrefetchBuffer
|| | |+ postgres!index_getnext_tid
|| | | + postgres!index_fetch_heap
|| | | + postgres!ExecIndexEvalRuntimeKeys
|| | | + postgres!ExecAssignScanProjectionInfoWithVarno
|| | | + postgres!tupledesc_match
|| | | + postgres!recompute_limits
|| | | + postgres!CheckValidRowMarkRel
|| | | + postgres!list_length
|| | | + pg_stat_statements!pgss_ExecutorRun
|| | | + postgres!PortalRunFetch
|| | | + postgres!PortalStart
|| | | + postgres!exec_bind_message
|| | | + postgres!PostgresMain
|| | | + postgres!BackendInitialize
|| | | + postgres!ClosePostmasterPorts
|| | | + postgres!main
|| | | + postgres!_onexit
|| | | + kernel32!BaseThreadInitThunk
|| | | + ntdll!RtlUserThreadStart
Name
ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!MiChargeWsles
|+ ntoskrnl!MiReleaseSystemCacheView
| + ntoskrnl!MmUnmapViewInSystemCache
| + ntoskrnl!CcUnmapVacb
| + ntoskrnl!CcUnmapVacbArray
| + ntoskrnl!CcGetVirtualAddress
| + ntoskrnl!CcMapAndCopyFromCache
| + ntoskrnl!CcCopyReadEx
| + ntfs!NtfsCopyReadA
| |+ fltmgr!FltpPerformFastIoCall
| | + fltmgr!FltpPassThroughFastIo
| | + fltmgr!FltpFastIoRead
| | + ntoskrnl!NtReadFile
| | + ntdll!NtReadFile
| | |+ kernelbase!ReadFile
| | | + msvcr120!_read_nolock
| | | + msvcr120!_read
| | | + postgres!PathNameOpenFile
| | | + postgres!??mdclose
| | | + postgres!ScheduleBufferTagForWriteback
| | | + postgres!InitBufTable
| | | + postgres!??PrefetchBuffer
| | | |+ postgres!index_getnext_tid
| | | | + postgres!index_fetch_heap
| | | | + postgres!ExecIndexEvalRuntimeKeys
| | | | + postgres!ExecAssignScanProjectionInfoWithVarno
| | | | + postgres!tupledesc_match
| | | | + postgres!recompute_limits
| | | | + postgres!CheckValidRowMarkRel
| | | | + postgres!list_length
| | | | + pg_stat_statements!pgss_ExecutorRun
| | | | + postgres!PortalRunFetch
| | | | + postgres!PortalStart
| | | | + postgres!exec_bind_message
| | | | + postgres!PostgresMain
| | | | + postgres!BackendInitialize
| | | | + postgres!ClosePostmasterPorts
| | | | + postgres!main
| | | | + postgres!_onexit
| | | | + kernel32!BaseThreadInitThunk
| | | | + ntdll!RtlUserThreadStart
If I do a top down (ie from when the process started where did we spend the most time) I get:
Name
ROOT
+ Process64 postgres (16668) Args: "--forkbackend" "43216"
+ Thread (16672) CPU=9399ms
|+ ntdll!RtlUserThreadStart
||+ kernel32!BaseThreadInitThunk
|| + postgres!_onexit
|| + postgres!main
|| + postgres!ClosePostmasterPorts
|| + postgres!BackendInitialize
|| + postgres!PostgresMain
|| + postgres!exec_bind_message
|| + postgres!PortalStart
|| + postgres!PortalRunFetch
|| + pg_stat_statements!pgss_ExecutorRun
|| + postgres!list_length
|| + postgres!CheckValidRowMarkRel
|| + postgres!recompute_limits
|| + postgres!tupledesc_match
|| + postgres!ExecAssignScanProjectionInfoWithVarno
|| |+ postgres!ExecIndexEvalRuntimeKeys
|| ||+ postgres!index_fetch_heap
|| || + postgres!index_getnext_tid
|| || |+ postgres!??PrefetchBuffer
|| || ||+ postgres!InitBufTable
|| || |||+ postgres!ScheduleBufferTagForWriteback
|| || ||||+ postgres!??mdclose
|| || |||||+ postgres!PathNameOpenFile
|| || ||||||+ msvcr120!_read
|| || |||||| + msvcr120!_read_nolock
|| || |||||| |+ kernelbase!ReadFile
|| || |||||| ||+ ntdll!NtReadFile
|| || |||||| || + ntoskrnl!NtReadFile
|| || |||||| || |+ fltmgr!FltpFastIoRead
|| || |||||| || ||+ fltmgr!FltpPassThroughFastIo
|| || |||||| || |||+ fltmgr!FltpPerformFastIoCall
|| || |||||| || ||||+ ntfs!NtfsCopyReadA
|| || |||||| || |||| + ntoskrnl!CcCopyReadEx
|| || |||||| || |||| |+ ntoskrnl!CcMapAndCopyFromCache
|| || |||||| || |||| | + ntoskrnl!CcGetVirtualAddress
|| || |||||| || |||| | |+ ntoskrnl!CcUnmapVacbArray
|| || |||||| || |||| | ||+ ntoskrnl!CcUnmapVacb
|| || |||||| || |||| | |||+ ntoskrnl!MmUnmapViewInSystemCache
|| || |||||| || |||| | ||| + ntoskrnl!ExAcquireSpinLockExclusive
|| || |||||| || |||| | ||| |+ ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
Also from this same perfview the following looks to be from the checkpoint process waiting on the same lock
Name
ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!ExAcquireSpinLockExclusive
+ ntoskrnl!MiAcquireProperVm
+ ntoskrnl!MiTrimSharedPageFromViews
+ ntoskrnl!MiTrimSection
+ ntoskrnl!MmTrimSection
+ ntoskrnl!CcCoherencyFlushAndPurgeCache
+ ntfs!NtfsFlushUserStream
+ ntfs!NtfsPerformOptimisticFlush
|+ ntfs!NtfsCommonFlushBuffers
| + ntfs!NtfsCommonFlushBuffersCallout
| + ntoskrnl!KeExpandKernelStackAndCalloutInternal
| + ntfs!NtfsCommonFlushBuffersOnNewStack
| + ntfs!NtfsFsdFlushBuffers
| + fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted
| + fltmgr!FltpDispatch
| + ntoskrnl!IopSynchronousServiceTail
| + ntoskrnl!NtFlushBuffersFileEx
| + ntoskrnl!NtFlushBuffersFile
| + ntdll!NtFlushBuffersFile
| |+ kernelbase!FlushFileBuffers
| | + msvcr120!_commit
| | + postgres!FileClose
| | + postgres!mdtruncate
| | + postgres!??ReleaseBuffer
| | + postgres!CreateCheckPoint
| | + postgres!CheckpointerMain
| | + postgres!AuxiliaryProcessMain
| | + postgres!MaxLivePostmasterChildren
| | + postgres!main
| | + postgres!_onexit
| | + kernel32!BaseThreadInitThunk
| | + ntdll!RtlUserThreadStart
In order to get by we increased the shared_buffers from 500MB to 50GB on this server (and 10GB on another server) but in my opinion this is just masking the issue. Was wondering if anyone in the community has seen contention with this lock before or has other any insights as to why we would suddenly run into this issue?
Ben Snaidero | |
Geotab | |
Senior Database Specialist | |
Direct | +1 (289) 230-7749 |
Toll-free | +1 (877) 436-8221 |
Visit | www.geotab.com |
Twitter | Facebook | YouTube | LinkedIn |