Back to articles list
- 4 minutes read

How to delete old records from a database while retaining some history

If the amount of data in your temporary table is growing very fast, you may want to expunge old unnecessary data from your database.



The scenario is like this: there is a table backup_data which contains backup data for various resources. The amount of data in the table is growing very fast, so you want to delete old unnecessary data. However, you still want to keep a little bit of historical data. The business rules for keeping the backups for every resource are a variation of the grandfather-father-son algorithm:

  • keep all backups from the last 24 hours,
  • from the week before the last 24 hours keep the latest hourly data,
  • from the month (30 days) before this week keep the latest daily data,
  • keep the latest weekly data from before the last month.

The picture shows the rules a little better.

Grandfather-father-son algorithm

A tempting solution might be to do all this programmatically, from your code. I will show you a pure SQL solution, which uses some Postgres SQL additions.

There will be three stored procedures: delete_backup_data_week, delete_backup_data_month and delete_backup_data_antique, respectively deleting versions dating from last week, last month, and older than last month. The procedures have to be invoked periodically using cron or a similar solution.

The function delete_backup_data_week looks like this:

CREATE OR REPLACE FUNCTION delete_backup_data_week(resourceid integer) 
RETURNS TABLE(backup_data_id int4) AS 
$$ 
BEGIN 
  RETURN QUERY 
    delete from backup_data ed 
      where resource_id = $1 
      and creation_date 
        between (current_timestamp - interval '8 day')
        and (current_timestamp - interval '24 hour') 
      and id not in 
        (select id 
         from (select distinct on (creation_date_hour)
                 id, creation_date, 
                 date_trunc('hour', creation_date) creation_date_hour 
               from backup_data 
               where resource_id = $1 
               and creation_date 
                 between (current_timestamp - interval '8 day')
                 and (current_timestamp - interval '24 hour') 
               order by creation_date_hour, creation_date desc)
               as first_in_hour_versions) 
      returning id; 
END 
$$ LANGUAGE 'plpgsql'; 

The procedure deletes old unnecessary records for a particular resource. The crucial part of the procedure is this subquery, selecting the latest version in every hour.

select distinct on (creation_date_hour) id, creation_date, date_trunc('hour', creation_date) creation_date_hour 
from backup_data 
where resource_id = $1 
  and creation_date 
    between (current_timestamp - interval '8 day')
    and (current_timestamp - interval '24 hour') 
order by creation_date_hour, creation_date desc)
as first_in_hour_versions;

We select the data associated with the chosen resource and created in the week preceding the last 24 hours (I have already marvelled how nice PostgreSQL date arithmetic is).

...
where resource_id = $1  
  and creation_date 
    between (current_timestamp - interval '8 day') 
    and (current_timestamp - interval '24 hour') 
...

We truncate the creation_date to the hour.

select ... date_trunc('hour', creation_date) creation_date_hour ...

We order the records by the truncated creation_date and then, among versions with the same truncated hour, by creation_date descending: the newest records are first.

...
order by creation_date_hour, creation_date desc)
as first_in_hour_versions

Finally, we select the first row in each group with the same creation_date_hour.

select distinct on (creation_date_hour)

Each part of this query is important. It’s especially important to order the records by creation_date descending. If we forget the ordering, the select distinct on (creation_date_hour) selects an unpredictable row.

The delete query is straightforward. Delete rows between the chosen dates, with ids NOT in first_in_hour_versions. Don’t forget to filter the rows by the creation date again, the inner query filtering is not working in the outer DELETE query.

delete from backup_data 
where resource_id = $1 
  and creation_date 
    between (current_timestamp - interval '8 day')
    and (current_timestamp - interval '24 hour') 
  and id not in (...) as first_in_hour_versions) 
returning id;

I use the RETURNING clause, because I want to log the ids of deleted rows in the code invoking the procedure. The delete query is wrapped in a stored procedure.

I think the complete solution is very nice. I haven’t expected such an elegant solution when I started working on the problem. In fact, I’m still surprised that a pure SQL solution is possible, let alone pretty. What do you think about the code?

Appendix: the remaining procedures

For completion, I also show the functions delete_backup_data_month and delete_backup_data_antique. In the procedure delete_backup_data_antique I truncate the creation to the week, another very nice Postgres feature I already talked about.

CREATE OR REPLACE FUNCTION delete_backup_data_month(resourceid integer) 
RETURNS TABLE(backup_data_id int4) AS 
$$ 
BEGIN 
  RETURN QUERY 
    delete from backup_data
    where resource_id = $1 
      and creation_date 
        between (current_timestamp - interval '38 day')
        and (current_timestamp - interval '8 day') 
      and id not in 
        (select id 
         from (select distinct on (creation_date_day)
                 id, creation_date, date_trunc('day', creation_date) creation_date_day 
               from backup_data 
               where resource_id = $1 
                 and creation_date 
                   between (current_timestamp - interval '38 day')
                   and (current_timestamp - interval '8 day') 
               order by creation_date_day, creation_date desc)
               as first_in_day_versions) 
     returning id; 
END 
$$ LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION delete_backup_data_antique(resourceid integer) 
RETURNS TABLE(backup_data_id int4) AS 
$$ 
BEGIN 
  RETURN QUERY 
    delete from backup_data 
    where resource_id = $1 
      and creation_date < (current_timestamp - interval '38 day') 
      and id not in 
       (select id 
        from (select distinct on (creation_date_week) 
                id, creation_date, date_trunc('week', creation_date) creation_date_week 
              from backup_data
              where resource_id = $1 
                and creation_date < (current_timestamp - interval '38 day') 
              order by creation_date_week, creation_date desc)
              as first_in_week_versions) 
    returning id; 
END 
$$ LANGUAGE 'plpgsql'; 

go to top