Recently we ran into an example of a ‘simple’ query that was doing an immense amount of work. There were a number of reasons behind this, but perhaps the most telling part was the difference between the EXPLAIN (which we routinely run on all system queries) and the EXPLAIN ANALYZE (which we don’t routinely run).

                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Limit  (cost=939.79..939.84 rows=20 width=291)
   ->  Sort  (cost=939.79..941.34 rows=620 width=291)
         Sort Key: t.firstname
         ->  Nested Loop  (cost=5.65..923.30 rows=620 width=291)
               ->  Nested Loop  (cost=5.22..722.71 rows=57 width=4)
                     ->  Index Scan using locations_pkey on locations l2  (cost=0.29..15.43 rows=2 width=1226)
                           Index Cond: (id = ANY ('{189,1}'::integer[]))
                     ->  Bitmap Heap Scan on locations l1  (cost=4.93..353.35 rows=29 width=36)
                           Recheck Cond: (l2.geom ~ centroid)
                           Filter: _st_contains(l2.geom, centroid)
                           ->  Bitmap Index Scan on location_centroid_gist  (cost=0.00..4.92 rows=86 width=0)
                                 Index Cond: (l2.geom ~ centroid)
               ->  Index Scan using users_locationid on users t  (cost=0.42..2.89 rows=63 width=291)
                     Index Cond: (locationid = l1.id)
                     Filter: (id <> '-1'::integer)
(15 rows)

Well, that doesn’t look to bad (apart from the text wrapping!)

This query was effectively pulling every user record (there are over one million users) joined to the location polygon record for that query if the user was in it. The polygon was the national region for the system (location.id = 1 is a disaggregated 3.2 MB multipolygon with 3800 islands and a density of just 0.19) and trying to sort all million of them in memory… and only return the first 20 ordered by firstname. The is a substantial job, and has a very high cost in memory usage. Because it was taking over 2 minutes, we could see that online users were hitting run again, and because (in this case) it was being erroneously pushed onto the master database*, we were exceeding it’s number of connections or memory (we typically run most read queries through an elastic load balancer into a set of autoscaling replicas.

My first question would be … why do we want to do this query? Select all the users from any area, but only return the first 20 ordered by name? I’m guessing it would be very useful for small areas, but, if the area selected could be the whole country … over a million users … less useful. So in fact the first optimisation might be to say “Do we need the capacity to do this, or should we limit this being run against the national (and even 1st level subnational) polygon sets.

But, this is not really a resource constrained environment, so why so slow – two main learnings:

Inappropriate use of indexes

When we use a query like st_contains or st_intersects, the dbms has to check every single point against every single polygon to determine if it is contained within the polygon. If the polygon is complex, this can be sped up by using a test called a bounding box check, where the computer compares the ‘extent’ of the polygon with the point we are testing (we create this bounding box when we add the index to the column). If you look at this image of the bounding box for Yogyakarta, the red line is the absolute maximum area that the point could be in if it was in one province. (Even if it was in this box, it might not be in said province.) Because the bounding box test is just maximum and minimum values, the test is very very fast.

Large-Poly-small-boundary

However, if a point is not in this box, we can exclude it immediately. Then we can test the points that were inside the box to see if they are actually inside the polygon. About 50% of the users that are left in our data set are now either inside or outside the polygon, so the number of points that needs the complex test ( st_contains() ) is small and the whole query seems quite fast. The GIST index on the geometry column makes this query work.

So what is the problem with using it for the whole country. It’s not a problem, but look at this image of the bounding box for the whole of said country:

80% of the bounding box is just water, and the bounding box contains nearly 100% of the users (I am just outside!). So now we have to test nearly every one of our 1 000 000 users against each of our 3800 complex polygons, as discussed above, and that is … slow.

So… can we improve the query – yes, if you are happy to take a risk

Given that we are really just testing for users in country, if we were happy to just select users inside the box and not do the accurate checking that they are actually on land, we can use:

=# EXPLAIN ANALYZE
-# SELECT t.id, t.firstname, t.surname, t.phone, t.email, t.userlanguage,
-# t.password, t.pin, t.locationid, t.area, t.infraid, t.groupid,
-# t.createdby, t.createdon, t.modifiedby, t.modifiedon, t.del,
-# t.picture, t.time_expired, t.msgid, t.deleted_email, t.centroid_proj,
-# t.restclient_id
-# FROM backoffice.users t
-# JOIN locations l1 ON l1.id = t.locationid
-# JOIN locations l2 ON l2.id IN (189,1) AND l2.geom && l1.centroid
-# WHERE t.id != -1
-# ORDER BY t.firstname asc
-# LIMIT 20;
                                                                        QUERY PLAN                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1043.47..1043.52 rows=20 width=291) (actual time=2830.328..2830.332 rows=20 loops=1)
   ->  Sort  (cost=1043.47..1050.23 rows=2703 width=291) (actual time=2830.327..2830.331 rows=20 loops=1)
         Sort Key: t.firstname
         Sort Method: top-N heapsort  Memory: 34kB
         ->  Nested Loop  (cost=5.07..971.55 rows=2703 width=291) (actual time=32.132..1841.069 rows=928410 loops=1)
               ->  Nested Loop  (cost=4.64..95.30 rows=249 width=4) (actual time=32.058..306.467 rows=84756 loops=1)
                     ->  Index Scan using locations_pkey on locations l2  (cost=0.29..15.43 rows=2 width=1226) (actual time=0.008..0.014 rows=2 loops=1)
                           Index Cond: (id = ANY ('{189,1}'::integer[]))
                     ->  Bitmap Heap Scan on locations l1  (cost=4.35..39.85 rows=9 width=36) (actual time=16.114..146.073 rows=42378 loops=2)
                           Recheck Cond: (l2.geom && centroid)
                           Heap Blocks: exact=17458
                           ->  Bitmap Index Scan on location_centroid_gist  (cost=0.00..4.35 rows=9 width=0) (actual time=8.981..8.981 rows=42378 loops=2)
                                 Index Cond: (l2.geom && centroid)
               ->  Index Scan using users_locationid on users t  (cost=0.42..2.89 rows=63 width=291) (actual time=0.003..0.015 rows=11 loops=84756)
                     Index Cond: (locationid = l1.id)
                     Filter: (id <> '-1'::integer)
 Planning time: 30.603 ms
 Execution time: 2831.324 ms
(18 rows)

 

of course, if there was a user in a neighbour that was within the box, they may also be included, but we think that is unlikely.

2) Memory usage and Disk swapping

This is a bit different, but when we join all three tables, some rows (where the user’s locationid = 1) are now very long – as well as the user data they also include the location data (which includes the polygon, which we know is 3.2MB. This probably exceeds the server’s available memory, so the system has to order by saving the data to disk, then reading a chunk, sorting it and then saving it back to disk. It repeats this with the next chunk and saves this and so on. After a while, it starts comparing the chunks, and if it finds mixing, it moves stuff from one chunk to another… and so on. This is often quite fast in memory, but if the amount of data to be sorted exceeds the memory available to the database, it starts writing this to disk, which is 100 – 10 000 times slower. If this query is running more than once, there is less memory available. . so it has to do more disk swapping… and so it gets slower.

PostgreSQL is very powerful and our servers are powerful, but this database is now dealing with an enormous amount of data, and so we need to be aware of how much data each query is using and think about optimising all query plans – and then testing them.