Slow Query with Derived Table in FROM Clause: A Deep Dive
In this article, we will delve into a common SQL performance issue related to derived tables used in the FROM clause. The problem arises when using a derived table (also known as a Common Table Expression or CTE) within a query that also references the same table in the FROM clause. We’ll explore the underlying reasons behind this performance degradation, provide examples and explanations, and discuss potential solutions to optimize such queries.
Background
Derived tables are used to simplify complex queries by breaking down a query into smaller, more manageable pieces. However, when these derived tables reference the same table in the FROM clause, it can lead to poor performance due to the increased number of joins required.
In this article, we will focus on an example query that uses a derived table in the FROM clause and discuss how to optimize its performance.
The Problem
The original query uses a derived table to calculate the cities 6km around a given city:
WITH location_distances (loc_id, distance) AS (
SELECT dest.id AS loc_id,
ROUND(1000 * 6371.03 * 2 * ASIN(SQRT( POWER(SIN((orig.latitude - ABS(dest.latitude)) * PI()/180 / 2),2) + COS(orig.latitude * PI()/180 ) * COS(ABS(dest.latitude) * PI()/180) * POWER(SIN((orig.longitude - dest.longitude) * PI()/180 / 2), 2) ))) AS distance
FROM locations orig,
locations dest
WHERE orig.id = 14861
AND (dest.type='V' OR dest.type='A')
AND dest.latitude BETWEEN orig.latitude - (6000 / 1000 / 111.045) AND orig.latitude + (6000 / 1000 / 111.045)
AND dest.longitude BETWEEN orig.longitude - (6000 / 1000 / (111.045 * COS(RADIANS(orig.latitude)))) AND orig.longitude + (6000 / 1000 / (111.045 * COS(RADIANS(orig.latitude))))
HAVING distance < 7000
),
location_distances_hierarchy (parent_loc_id, loc_id, distance) AS (
SELECT DISTINCT ll.location_id, ld.loc_id, ld.distance
FROM locations_locations ll,
location_distances ld
WHERE ll.rel_loc_id = ld.loc_id
)
SELECT c.id AS cours_id,
GROUP_CONCAT(DISTINCT CONCAT(cl.type,'-',cl.home,'-',cl.e,'-',ldh.distance,'-',ldh.loc_id)) AS all_loc
FROM cours2 c
JOIN cours_locations cl ON cl.domain = '4' AND c.id = cl.cours_id
JOIN location_distances_hierarchy ldh ON cl.location_id = ldh.parent_loc_id
WHERE c.active_today = '1'
AND c.subject_id = 404
GROUP BY c.id;
The original query uses two derived tables: location_distances and location_distances_hierarchy. The first derived table calculates the distances between cities, while the second derived table creates a hierarchy of locations.
Performance Issue
When analyzing the query execution plan using EXPLAIN, we notice that the derived table location_distances_hierarchy is not being optimized properly:
|id|select_type|table|type|possible_keys|key|key_len|ref|rows|Extra|
|1|PRIMARY|ss|ref|PRIMARY,subject_id|PRIMARY|4|const|1|Using index; Using temporary; Using filesort|
|1|PRIMARY|c|ref|ID,cours2_active_today_id,cours2_active_today_domain_display_home_update_id,cours2_active_today_subject_id_id,cours2_active_today_subject_id_domain_lang_id,cours2_active_today_subject_id_lang_publish_end_id,cours2_active_today_lang_priv_loc_MOVE_subject_id_id,cours2_active_today_lang_subject_id_id,cours2_active_today_lang_priv_loc_ADR_subject_id_id,cours2_active_today_lang_priv_loc_WEBCAM_subject_id_id,cours2_subject_id_lang_id_index|cours2_active_today_subject_id_domain_lang_id|5|const,trouver-un-cours.ss.rel_subject_id|30|Using where; Using index|
|1|PRIMARY|caf|eq_ref|PRIMARY|PRIMARY|26|trouver-un-cours.c.id|1|""|
|1|PRIMARY|cl|ref|PRIMARY,domain,location_id,cours_locations_domain_home_cours_id_index|PRIMARY|27|trouver-un-cours.c.id,const|2|Using where; Using index|
|1|PRIMARY|<derived3>|ref|key0|key0|4|trouver-un-cours.cl.location_id|10|Using where|
|3|DERIVED|<derived2>|ALL|||||1174|Using temporary|
|3|DERIVED|ll|ref|rel_loc_id|rel_loc_id|4|ld.loc_id|7|""|
|2|DERIVED|orig|const|PRIMARY|PRIMARY|4|const|1|""|
|2|DERIVED|dest|range|locations_type_D_id,locations_type_longitude_latitude|locations_type_longitude_latitude|9||1174|Using index condition; Using where|
As we can see, the derived table location_distances_hierarchy is not being used efficiently due to the following issues:
- The query uses a temporary result set (
Using temporary) instead of a materialized view or an indexed view. - The query does not use the index on the
rel_loc_idcolumn in thelocations_locationstable.
Solutions
To optimize the performance of this query, we can implement the following solutions:
Solution 1: Using Materialized Views
One approach to optimizing the derived table is to create a materialized view that stores the pre-computed values. We can create a new table with the same structure as the location_distances_hierarchy derived table and populate it with the computed values:
CREATE TABLE location_distances_hierarchy_materialized AS
SELECT DISTINCT ll.location_id, ld.loc_id, ld.distance
FROM locations_locations ll,
location_distances ld
WHERE ll.rel_loc_id = ld.loc_id;
INSERT INTO location_distances_hierarchy_materialized (location_id, distance)
SELECT loc_id, distance
FROM location_distances;
Then, we can modify the original query to use the materialized view:
SELECT c.id AS cours_id,
GROUP_CONCAT(DISTINCT CONCAT(cl.type,'-',cl.home,'-',cl.e,'-',ldh.distance,'-',ldh.loc_id)) AS all_loc
FROM cours2 c
JOIN cours_locations cl ON cl.domain = '4' AND c.id = cl.cours_id
JOIN location_distances_hierarchy_materialized ldh ON cl.location_id = ldh.parent_loc_id
WHERE c.active_today = '1'
AND c.subject_id = 404
GROUP BY c.id;
By using a materialized view, we can avoid the temporary result set and improve performance.
Solution 2: Using Indexed Views
Another approach to optimizing the derived table is to create an indexed view that stores the pre-computed values. We can create a new view with an index on the rel_loc_id column:
CREATE VIEW location_distances_hierarchy_view AS
SELECT DISTINCT ll.location_id, ld.loc_id, ld.distance
FROM locations_locations ll,
location_distances ld
WHERE ll.rel_loc_id = ld.loc_id;
CREATE INDEX idx_location_distances_hierarchy ON location_distances_hierarchy_view (location_id);
INSERT INTO location_distances_hierarchy_view (location_id, distance)
SELECT loc_id, distance
FROM location_distances;
Then, we can modify the original query to use the indexed view:
SELECT c.id AS cours_id,
GROUP_CONCAT(DISTINCT CONCAT(cl.type,'-',cl.home,'-',cl.e,'-',ldh.distance,'-',ldh.loc_id)) AS all_loc
FROM cours2 c
JOIN cours_locations cl ON cl.domain = '4' AND c.id = cl.cours_id
JOIN location_distances_hierarchy_view ldh ON cl.location_id = ldh.location_id
WHERE c.active_today = '1'
AND c.subject_id = 404
GROUP BY c.id;
By using an indexed view, we can improve the query performance by allowing the database to use the index when joining tables.
Solution 3: Using Derived Tables with Joins
Another approach to optimizing the derived table is to rewrite the original query using a derived table that joins directly with the cours2 table:
WITH location_distances (loc_id, distance) AS (
SELECT dest.id AS loc_id,
ROUND(1000 * 6371.03 * 2 * ASIN(SQRT( POWER(SIN((orig.latitude - ABS(dest.latitude)) * PI()/180 / 2),2) + COS(orig.latitude * PI()/180 ) * COS(ABS(dest.latitude) * PI()/180) * POWER(SIN((orig.longitude - dest.longitude) * PI()/180 / 2), 2) ))) AS distance
FROM locations orig,
locations dest
WHERE orig.id = 14861
AND (dest.type='V' OR dest.type='A')
AND dest.latitude BETWEEN orig.latitude - (6000 / 1000 / 111.045) AND orig.latitude + (6000 / 1000 / 111.045)
AND dest.longitude BETWEEN orig.longitude - (6000 / 1000 / (111.045 * COS(RADIANS(orig.latitude)))) AND orig.longitude + (6000 / 1000 / (111.045 * COS(RADIANS(orig.latitude))))
HAVING distance < 7000
),
location_distances_hierarchy (parent_loc_id, loc_id, distance) AS (
SELECT ll.location_id, ld.loc_id, ld.distance
FROM locations_locations ll,
location_distances ld
WHERE ll.rel_loc_id = ld.loc_id
)
SELECT c.id AS cours_id,
GROUP_CONCAT(DISTINCT CONCAT(cl.type,'-',cl.home,'-',cl.e,'-',ldh.distance,'-',ldh.loc_id)) AS all_loc
FROM cours2 c
JOIN cours_locations cl ON cl.domain = '4' AND c.id = cl.cours_id
JOIN location_distances_hierarchy ldh ON cl.location_id = ldh.parent_loc_id
WHERE c.active_today = '1'
AND c.subject_id = 404
GROUP BY c.id;
By rewriting the original query using a derived table that joins directly with the cours2 table, we can avoid the additional join required by the materialized view or indexed view approach.
Conclusion
In conclusion, this article has discussed a common SQL performance issue related to derived tables used in the FROM clause. We have explored the underlying reasons behind this performance degradation and provided potential solutions to optimize such queries. By using materialized views, indexed views, or rewriting the original query using derived tables with joins, we can improve the performance of our SQL queries.
Example Use Case
Suppose we are a data analyst working for an e-commerce company that sells products in multiple countries. We need to calculate the distances between cities and provide this information to customers who want to buy products from other locations.
To solve this problem, we create two derived tables: location_distances and location_distances_hierarchy. The first table stores the pre-computed values for the distance calculations, while the second table creates a hierarchy of locations based on their parent-child relationships.
We then modify our query to use these derived tables, joining them with the cours2 table to retrieve customer information. By using materialized views or indexed views, we can improve the performance of our query and provide faster results to our customers.
Code
Here is an example of how you could implement the solutions described above:
-- Create a new table for materialized view
CREATE TABLE location_distances_hierarchy_materialized AS
SELECT DISTINCT ll.location_id, ld.loc_id, ld.distance
FROM locations_locations ll,
location_distances ld
WHERE ll.rel_loc_id = ld.loc_id;
INSERT INTO location_distances_hierarchy_materialized (location_id, distance)
SELECT loc_id, distance
FROM location_distances;
-- Create a new view for indexed view
CREATE VIEW location_distances_hierarchy_view AS
SELECT DISTINCT ll.location_id, ld.loc_id, ld.distance
FROM locations_locations ll,
location_distances ld
WHERE ll.rel_loc_id = ld.loc_id;
CREATE INDEX idx_location_distances_hierarchy ON location_distances_hierarchy_view (location_id);
-- Insert data into materialized view or indexed view
INSERT INTO location_distances_hierarchy_materialized (location_id, distance)
SELECT loc_id, distance
FROM location_distances;
-- Rewrite the query using derived table with joins
WITH location_distances (loc_id, distance) AS (
SELECT dest.id AS loc_id,
ROUND(1000 * 6371.03 * 2 * ASIN(SQRT( POWER(SIN((orig.latitude - ABS(dest.latitude)) * PI()/180 / 2),2) + COS(orig.latitude * PI()/180 ) * COS(ABS(dest.latitude) * PI()/180) * POWER(SIN((orig.longitude - dest.longitude) * PI()/180 / 2), 2) ))) AS distance
FROM locations orig,
locations dest
WHERE orig.id = 14861
AND (dest.type='V' OR dest.type='A')
AND dest.latitude BETWEEN orig.latitude - (6000 / 1000 / 111.045) AND orig.latitude + (6000 / 1000 / 111.045)
AND dest.longitude BETWEEN orig.longitude - (6000 / 1000 / (111.045 * COS(RADIANS(orig.latitude)))) AND orig.longitude + (6000 / 1000 / (111.045 * COS(RADIANS(orig.latitude))))
HAVING distance < 7000
),
location_distances_hierarchy (parent_loc_id, loc_id, distance) AS (
SELECT ll.location_id, ld.loc_id, ld.distance
FROM locations_locations ll,
location_distances ld
WHERE ll.rel_loc_id = ld.loc_id
)
SELECT c.id AS cours_id,
GROUP_CONCAT(DISTINCT CONCAT(cl.type,'-',cl.home,'-',cl.e,'-',ldh.distance,'-',ldh.loc_id)) AS all_loc
FROM cours2 c
JOIN cours_locations cl ON cl.domain = '4' AND c.id = cl.cours_id
JOIN location_distances_hierarchy ldh ON cl.location_id = ldh.parent_loc_id
WHERE c.active_today = '1'
AND c.subject_id = 404
GROUP BY c.id;
Note that the code above is just an example and may need to be modified to fit your specific use case.
Last modified on 2023-06-09