Re: indexes missing

Fabio Pardi <f.pardi@xxxxxxxxxxxx> · Fri, 4 May 2018 12:32:55 +0200

Hi Thomas,

if i recall correctly, index bloat can happen in various situations, a common one on BTREE indexes is when you have sparse deletions on the referred table and the index becomes unbalanced because the deletions are filled in with empty spaces.

As per your question. short answer is: no, but sometimes yes. 

Long answer touches several aspects of the database architecture. 

I'll try to make a soup of it, with the help of my notes in the hope i will make it clearer, and not more obscure... 

Every time you insert a record in the database, also extra internal Postgres records are written, in order to know 'when in time' the record was set and whom it is visible to. Those records are called xmin and xmax. They go together with another record called XID.

When you run a query, any query, you can only see all those records that have Xmin < XID > Xmax

Now, if your record is updated, then internally a new record is placed, with a newer Xmin, Xmax. Note that the old record still exists internally! It tells to Postgres: 'Hey, we have a new occurrence for that record. Do not use the old one'. During normal operations, the next vacuum will get rid of that old useless record.

But, what if a query was running and in the meanwhile you update the table it is running on? Here comes the 'but sometimes yes' part of my answer.

Imagine i m running a query about the number of alive citizens of London. Query starts at 13:00:00. I would expect to know the exact number of citizens at the time the query starts.

What If the query takes 2 minutes to run, and in the meanwhile we have 2 deceased? The already running query will return the number of alive citizens at 13:00:00 while the database will ingest the new records, create 2 new rows about the deceased citizens, with a new xmin. then go to the old rows about the 2 deceased, and update xmax to avoid visibility to the new queries. 

If it sounds complicated in words an example can maybe make it more clear. I m taking an example from the book i have on my shelf, PostgreSQL 9.0 High Performance (by Gregory Smith, a must read in my opinion)

create table old_records(id int PRIMARY KEY,  value varchar);

insert into old_records values (1 , 'abc'); 

Now, we can check xmin and xmax for the current record:

select *,xmin,xmax from old_records ;

 id | value | xmin  | xmax 
----+-------+-------+------
  1 | abc   | 73037 |    0
(1 row)

SELECT * FROM txid_current();
 txid_current 
--------------
        73038
(1 row)

------------
now open a new session:
-- session2 --

begin ;
BEGIN

SELECT * FROM txid_current();
 txid_current 
--------------
        73039
(1 row)

update old_records SET value='def'  where id=1 ;

select *,xmin,xmax from old_records ;
 id | value | xmin  | xmax
----+-------+-------+------
  1 | def   | 73039 |    0

---------------

while if you go back to session 1:

portavita=# select *,xmin,xmax from old_records ;
 id | value | xmin  | xmax  
----+-------+-------+-------
  1 | abc   | 73037 | 73039

---------------

This means that the database has 2 different occurrences of the record id=1 and those are available to 2 different sessions in 2 different ways.

Once you will run 'end;' in session 2, then the old record will disappear from the visibility, and Xmax will be reset
.

Next vacuum will then get rid of the old occurrences, not needed any longer.

Hope it helps.

regards,

fabio pardi

On 03/05/18 20:17, Thomas Poty wrote:
> Thanks Fabio for your useful advice. 
> About index bloat and bloat in general, if i am right it is due to mvcc. So my question is : is there any way to retrieve old data? 
> 
> Thank you 
> 
> Thomas 
> 
> Le lun. 23 avr. 2018 à 17:22, Fabio Pardi <f.pardi@xxxxxxxxxxxx <mailto:f.pardi@xxxxxxxxxxxx>> a écrit :
> 
>     Hi Thomas,
> 
>     I usually also take a look to 'Unused Indexes' and 'Index Bloat' too.
> 
> 
>     Unused indexes:
> 
>     after some time you use your database, you should be able to identify indexes that are never used.
> 
>     Is good to have in your database only indexes you use, given the impact of indexes into your db (space, time to write data where indexes are present), and remove the unused ones.
> 
> 
> 
>     Index bloat:
> 
>     https://wiki.postgresql.org/wiki/Index_Maintenance#Index_Bloat
> 
> 
> 
>     On the same wiki page you can read more about indexes maintenance, which might clarify some other doubt you have.
> 
> 
>     About your question on missing indexes, I m not sure what to answer.
>     I think it depends a lot on your setup and how data is queried, eg: if my memory serves me well, sequential scan can be chosen by the query plan in some cases, even where an index is present.
> 
>     A way I use to tackle slow queries is to periodically analyse statistics and to have slow queries reported in the logs (eg: queries slower than X ms)
> 
> 
>     regards,
> 
>     fabio pardi
> 
> 
> 
>     On 04/23/2018 12:22 PM, Thomas Poty wrote:
>     > hello World,
>     >
>     > Soon, Il will have to "monitor" indexes of a database in production...
>     > I know pg_stattuple may help with this job. I also have several query to help me :
>     >
>     >
>     > This query indicates if an index is invalid :
>     >
>     > select ir.relname    as indexname,
>     >    it.relname        as tablename,
>     >    n.nspname         as schemaname
>     > from pg_index i
>     >    join pg_class ir on ir.oid = i.indexrelid
>     >    join pg_class it on it.oid = i.inderelid
>     >    join pg_namespace n on n.oid = it.relnamespace
>     > where not i.indisvalid;
>     >
>     >
>     >
>     > This one indicates if there are duplicated indexes
>     >
>     >
>     >
>     > select il.table_name,
>     >     il.index_columns,
>     >     array_agg(il.index_name)    as implied_indexes_name
>     > from (
>     >         select
>     >             distinct(pct.relname,pci.relname,pi.indkey) as key,
>     >             pct.oid        as table_oid,
>     >             pct.relname    as table_name,
>     >             pci.relname    as index_name,
>     >             pi.indkey      as index_columns
>     >         from pg_index pi
>     >             join pg_class pci
>     >                 on pi.indexrelid=pci.oid
>     >             join pg_class pct
>     >                 on pi.indrelid=pct.oid
>     >             join pg_attribute pa
>     >                 on pa.attrelid=pct.oid
>     >         where pct.relkind='r'
>     >             and pa.attnum=any(pi.indkey)
>     >      ) il
>     > group by il.table_name, il.index_columns
>     > having count(*)>1;
>     >
>     >
>     > I have found this one but i am not sure if it is technically correct :
>     > - the table must be greater then 100 kB
>     > - the way of a "missing index" is calculated ( Can i have your opinion?)
>     >
>     > SELECT  relname             AS TableName,
>     >         seq_scan-idx_scan   AS TotalSeqScan,
>     >         CASE WHEN seq_scan-idx_scan > 0
>     >             THEN 'Missing Index Found'
>     >             ELSE 'Missing Index Not Found'
>     >         END                 AS MissingIndex,
>     >         pg_size_pretty(pg_relation_size(concat(schemaname,'.',relname)::regclass)) AS TableSize,
>     >         idx_scan            AS TotalIndexScan
>     > FROM pg_stat_all_tables
>     > WHERE schemaname !~'pg_catalog|pg_temp'
>     >     AND pg_relation_size(concat(schemaname,'.',relname)::regclass)>100000
>     > ORDER BY 2 DESC;
>     >
>     >
>     > Is there any others stuffs to keep an eye?
>     > Is there any remarks about my queries?
>     >
>     > Thanks a lot,
>     >
>     > Thomas
>