Best way to delete big amount of records from big table

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello list,

I'm trying to clean up a database with millions of records of useless-but-don't-remove-just-in-case data. This database has all tables in public schema so I've created a new schema "old_data" to move there all this data. I have several tables with 20million of records or so that I've managed to clean up relatively fast without special effort (not having to drop indexes or constraints) What I've made with these tables is easy as these ones are going to be emptied (I have to keep tables) so I only have to insert data into old_data.new_table and truncate cascade.

But also I'm cleaning tables with 150million records where I'm going to remove 60% of existing data and after a few tests I'm not sure what's the best approach as all seem to take similar time to run. These tables are grouped in 4 tables group with master, detail, master_history, detail_history structure. None of the tables have primary key nor foreign key or any constraint but the sequence used for what should be the PK column, though this column is not defined as PK.

I've decided to delete from the last one in chunks (10 days of data per chunk but it coud be any other quantity) so I've created a function.  I've tested it with indexes (in master_hist for filtering data and in detail_hist for the fk and pk), without indexes, after analyzing table, and no matter what I always end up with more or less the same execution time. I can afford the time it's getting to run but I'd like to know if it's there a better way to do this. I'm testing on version 9.2 BUT production server is 8.4 (legacy application, supposed to be in at least 9.2 but recently discovered it was 8.4, planning upgrade but not now). Config parameters are default ones.

Table definition:

CREATE TABLE master (

  id integer serial NOT NULL,
  device_id int4 NOT NULL,
  col1 int4 NULL DEFAULT 0,
  data_date bpchar(17) NULL, -- field to filter data
  data_file_date bpchar(14) NULL
); -- 9 of 20 records to be removed

CREATE TABLE detail (
  id integer serial NOT NULL,
  parent_id int4 NOT NULL,
  col1 float8 NULL,
  col2 int4 NOT NULL
); -- 2304 of 5120 records to be removed

CREATE TABLE master_history (
  id integer serial NOT NULL,
  device_id int4 NOT NULL,
  col1 int4 NULL DEFAULT 0,
  data_date bpchar(17) NULL, -- field to filter data
  data_file_date bpchar(14) NULL
);  --355687 of 586999 records to be removed

CREATE TABLE detail_history (
  id integer serial NOT NULL,
  parent_id int4 NOT NULL,
  col1 float8 NULL,
  col2 int4 NOT NULL
); -- 91055872 of  150.271.744 records to be removed


And the function:

CREATE or replace FUNCTION delete_test() RETURNS integer AS $$
DECLARE
    _begin_date date;
    _end_date date := '2019-08-01';
    _begin_exec timestamp := clock_timestamp();
    _end_exec timestamp ;
    _begin_exec_partial timestamp;
    _end_exec_partial timestamp;
    _time double precision;
    _num_regs integer;
BEGIN
    for _begin_date in (select '2018-05-01'::date + s.a * '10 days'::interval from (select generate_series(0,1000) as a) as s)
    loop
        if (_begin_date > _end_date) then
            raise log 'STOP!!!!!';
            exit;
        end if;
        raise log 'Date %', _begin_date;
        _begin_exec_partial := clock_timestamp();
        delete from public.detail_history t1
          where exists
            (select 1 from public.master_history t2
              where t2.id = t1.parent_id
                and t2.data_date >= rpad(to_char(_begin_date, 'YYYYMMDD'), 17, '0')                 and t2.data_date < rpad(to_char((_begin_date + interval '10 days'), 'YYYYMMDD'), 17, '0'));
        GET DIAGNOSTICS _num_regs = ROW_COUNT;
        _end_exec_partial := clock_timestamp();
        _time := 1000 * ( extract(epoch from _end_exec_partial) - extract(epoch from _begin_exec_partial) );
        raise log 'Records removed % in % ms', _num_regs, _time;

    end loop;

    _end_exec := clock_timestamp();
    _time := 1000 * ( extract(epoch from _end_exec) - extract(epoch from _begin_exec) );
    raise log 'Total time: %', _time;
    return 0;
END;
$$ LANGUAGE plpgsql;


Delete execution plan in 8.4 is:

test_eka=# explain delete from public.detail_hist t1
test_eka-#   where exists
test_eka-#     (select 1 from public.master_hist t2
test_eka(#       where t2.id = t1.parent_id
test_eka(#         and t2.data_date >= '20180501000000000000000'
test_eka(#         and t2.data_date < '20190101000000000000000');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=33431.46..5890182.88 rows=156649104 width=6)
   Hash Cond: (t1.parent_id = t2.id)
   ->  Seq Scan on detail_hist t1  (cost=0.00..2564256.04 rows=156649104 width=10)
   ->  Hash  (cost=30922.13..30922.13 rows=152906 width=4)
         ->  Unique  (cost=30157.60..30922.13 rows=152906 width=4)
               ->  Sort  (cost=30157.60..30539.87 rows=152906 width=4)
                     Sort Key: t2.id
                     ->  Seq Scan on master_hist t2 (cost=0.00..14897.65 rows=152906 width=4)                            Filter: ((data_date >= '20180501000000000000000'::bpchar) AND (data_date < '20190101000000000000000'::bpchar))


After PK-FK creation (with IX over FK)

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=26678.41..5883424.77 rows=156648960 width=6)
   Hash Cond: (t1.id_param_espec_este = t2.id_param_espec_este_historico)
   ->  Seq Scan on param_espec_este_datos_historico_tbl t1 (cost=0.00..2564254.60 rows=156648960 width=10)
   ->  Hash  (cost=24169.09..24169.09 rows=152906 width=4)
         ->  Unique  (cost=23404.56..24169.09 rows=152906 width=4)
               ->  Sort  (cost=23404.56..23786.82 rows=152906 width=4)
                     Sort Key: t2.id_param_espec_este_historico
                     ->  Index Scan using fecha_gps_pe_este_hist_idx on param_espec_este_historico_tbl t2 (cost=0.00..8144.60 rows=152906 width=4)                            Index Cond: ((fecha_gps_parametros_espectrales >= '20180501000000000000000'::bpchar) AND (fecha_gps_parametros_espectrales < '20190101000000000000000'::bpchar))


Any ideas are welcome.

Kind regards,

Ekaterina.







[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux