Updating Rows in Postgres: Alternative Approach to Avoid Duplicated Lines in Aggregated Updates

Postgres: Updating rows based on aggregated value without returning duplicated lines

In this article, we will explore a common use case for updating rows in a PostgreSQL table based on an aggregated value. The scenario involves selecting rows from the same hour, locking them, and then updating another column while setting a specific value.

Understanding the Problem

To start with, let’s break down the problem at hand. We have a table some_table containing columns like timestamp, id, and others. Our goal is to update the started column in these rows by setting it to the current timestamp (now()) while ensuring that only one group of rows is updated per hour.

Initial Solution

The initial solution provided in the question uses a subquery with aggregation to achieve this. Here’s a breakdown of how it works:

with subq as (SELECT some_column, array_agg(timestamp) as timestamps, array_agg(id) as ids, hour FROM (
    SELECT *, date_trunc('hour', timestamp) as hour from some_table FOR UPDATE
) as inner_table GROUP BY some_column, hour LIMIT 1)
UPDATE some_table SET started = now() FROM subq WHERE some_table.id = any(subq.ids) 
RETURNING subq.some_column, subq.timestamps;

The query works as follows:

  • It first selects the some_column, aggregated timestamps (array_agg(timestamp)), and aggregated ids (array_agg(id)) for each group of rows from the same hour. The aggregation is done over these groups.
  • Then it updates the started column in the original table by setting it to the current timestamp (now()) for each row that exists in the ids array returned by the subquery.

Issue with Initial Solution

However, upon executing this query, we notice an issue - the RETURNING clause returns one line per updated line. Since we’re only referencing subq in this clause, these lines appear to be duplicated.

This behavior occurs because PostgreSQL treats each row in the result set returned by a subquery as a separate entity, even if they share the same values (i.e., belong to the same group). When you use an aggregation function (array_agg) on these groups and then update based on this aggregated column (ids), the RETURNING clause returns one row for each value in the aggregated array. This is why we see duplicated lines when executing the query.

Alternative Solution

The alternative solution provided wraps the update inside another subquery, returning only the original subquery instead of using a RETURNING clause:

with subq as (SELECT some_column, array_agg(timestamp) as timestamps, array_agg(id) as ids, hour FROM (
    SELECT *, date_trunc('hour', timestamp) as hour from some_table FOR UPDATE
) as inner_table GROUP BY some_column, hour LIMIT 1),
update_subq as (
    UPDATE some_table SET started = now() FROM subq WHERE some_table.id = any(subq.ids) 
)
SELECT * FROM subq;

Here’s a breakdown of how this alternative solution works:

  • The outer query (with update_subq...) is similar to the initial solution. It first creates a subquery (subq) that aggregates rows by hour, just like before.
  • The inner query (update_subq) executes the same update operation as in the initial solution but without using a RETURNING clause.

Why Does This Work?

This alternative solution works because it doesn’t rely on the RETURNING clause to retrieve data from the updated rows. Instead, it explicitly selects all columns (*) from the subq subquery at the end of the query plan.

By doing so, PostgreSQL is able to determine that there’s only one group being returned (because we’ve already limited it by hour in the outer query), allowing us to avoid duplicated lines when executing this alternative solution.

Best Practices and Alternative Approaches

While both solutions work, the approach in the alternative solution provides more control over the data retrieval process.

  • Using SELECT * FROM subq at the end ensures that only one group is returned, eliminating the need for aggregation and limiting it to a single row per hour.
  • This approach might be preferable when you want finer-grained control over which columns are returned from the update.

However, if you’re comfortable with aggregating data within your query (for instance, if you want to calculate additional statistics), both solutions will work just fine.

Conclusion

In this article, we explored a scenario where we needed to update rows in PostgreSQL based on an aggregated value while avoiding duplicate returned lines. We examined two approaches: the initial solution that uses RETURNING and another alternative approach that wraps the update operation within a subquery.

We learned about how PostgreSQL handles aggregations with aggregation functions like array_agg, leading to duplicated results when used in conjunction with RETURNING.


Last modified on 2024-05-23