About Me

My photo
I am currently working as a DB Consultant with Ashnik. We are into Open Source and Cloud based enterprise class solutions. Prior to this I was working as a DBA @ Misys India (Banking Services). Working on IBM DB2, Oracle Microsoft SQL Server. Selected as IBM Champion for Data Management 2011. I like bit of photography, cycling, poetry and I just love gadgets!!

Wednesday, August 7, 2013

Things you didn't know about UPDATE statement

I am taking a deviation from the series of posts I am doing for Recovery Concepts and Implementation.
I have already posted two in that series. This one has to come, cause I was recently discussing with some developers and they were having tough time dealing with updates to data based on/using data from some other source table.
I  realized, people generally do not know all the capabilities of an Update statement. The form they know is:

UPDATE TABLE_NAME
set COLUMN_NAME= << expression >> [, COLUMN_NAME= << expression >> ]
where << boolean expression >>
Because they are not aware of the much more flexible syntax they end up exercising all the flexibility they get with expression and boolean expression. We will see an example where a complicated expression is used to achieve something which is fairly simple.

Before I rush you into any technical details, let me explain the table structures to be used in this example.
Table "enterprisedb.update_target"
      Column       |          Type          | Modifiers
-------------------+------------------------+-----------
 column_id         | integer                | not null
 columne_to_update | character varying(100) |
Indexes:
    "update_target_pkey" PRIMARY KEY, btree (column_id)

Table "enterprisedb.update_source"
      Column       |          Type          | Modifiers
-------------------+------------------------+-----------
 column_id         | integer                | not null
 source_for_update | character varying(100) |
Indexes:
    "update_source_pkey" PRIMARY KEY, btree (column_id)

count_update_target | count_update_source
---------------------+---------------------
             1310720 |               10240
(1 row)

edb=# select * from update_source t1 where not exists (select 1 from update_target t2 where t2.column_id=t1.column_id);
 column_id |        source_for_update
-----------+----------------------------------
   2621436 | this is an new value for 2621437
   2621438 | this is an new value for 2621439
   2621440 | this is an new value for 2621441
   

edb=# select count(*) from update_target where columne_to_update is null;
 count
-------
     0
(1 row)
Now suppose I want to update all those rows, in update_target which has a matching ID (column_id) in update_source. For this simple table (where I have to update only one column) the statement may be as simple as below. 

update update_target t1 set columne_to_update =(select source_for_update from update_source t2 where t1.column_id=t2.column_id);

This could be a bit more complex, with multiple select statements (one for each column to be updated). 
Now let's see what the plan looks like for this simple case:

edb=# explain analyze update update_target t1 set columne_to_update =(select source_for_update from update_source t2 where t1.column_id=t2.column_id);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on update_target t1  (cost=0.00..4392654.13 rows=1341439 width=14) (actual time=18187.897..18187.897 rows=0 loops=1)
   ->  Seq Scan on update_target t1  (cost=0.00..4392654.13 rows=1341439 width=14) (actual time=0.058..5259.089 rows=1310720 loops=1)
         SubPlan 1
           ->  Index Scan using update_source_pkey on update_source t2  (cost=0.00..3.27 rows=1 width=30) (actual time=0.002..0.002 rows=0 loops=1310720)
                 Index Cond: (t1.column_id = column_id)
 Total runtime: 18188.005 ms

Keep in mind that the complexity of the plan would had been double if there were two columns to be updated. 

So let me add an additional column here:
edb=# alter table update_target add column column_2_for_update integer;
ALTER TABLE
edb=# alter table update_source add column column_2_source integer;
ALTER TABLE
edb=# update update_source set column_2_source=column_id;
UPDATE 10240

Do you notice an additional Subplan in the plan below?

edb=# explain analyze update update_target t1
	set columne_to_update =(select source_for_update from update_source t2
			where t1.column_id=t2.column_id),
	column_2_for_update=(select column_2_source from update_source t2
			  where t1.column_id=t2.column_id);

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on update_target t1  (cost=0.00..8573070.33 rows=1310720 width=10) (actual time=22256.890..22256.890 rows=0 loops=1)
   ->  Seq Scan on update_target t1  (cost=0.00..8573070.33 rows=1310720 width=10) (actual time=0.096..9104.759 rows=1310720 loops=1)
         SubPlan 1
           ->  Index Scan using update_source_pkey on update_source t2  (cost=0.00..3.27 rows=1 width=30) (actual time=0.002..0.002 rows=0 loops=1310720)
                 Index Cond: (t1.column_id = column_id)
         SubPlan 2
           ->  Index Scan using update_source_pkey on update_source t2  (cost=0.00..3.27 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=1310720)
                 Index Cond: (t1.column_id = column_id)
 Total runtime: 22256.999 ms
(9 rows)

That's a difference of 22%.
So let's fire the query and see its effect.

edb=# update update_target t1
	set columne_to_update =(select source_for_update 
                                     from update_source t2
		              	     where t1.column_id=t2.column_id),
	column_2_for_update=(select column_2_source 
                                     from update_source t2
		              	     where t1.column_id=t2.column_id);

But what has happened to those rows in update_target which were not able to find a match in update_source? Well those would have got themselves updated to null. Let's confirm

edb# select count(*) from update_target where columne_to_update is null;
 
   count
 ---------
 1300483
 (1 row)
 

Actually the above UPDATE statement is supposed to update everything at update_target by fetching data from update_source. If something is not found then it is unknown i.e. NULL. So we need to restrict the rows which needs to be updated, with a filter which says "update only those rows for which a match exists in update_source". So the query will look something like this:

update update_target t1 
set columne_to_update =(select source_for_update from update_source t2 
	where t1.column_id=t2.column_id),
column_2_for_update=(select column_2_source from update_source t2 
	where t1.column_id=t2.column_id)
where exists (select 1 from update_source t3 where t3.column_id=t1.column_id); 

Let's see how this one performs.

edb=#  explain analyze update update_target t1 
	set columne_to_update =(select source_for_update from update_source t2 
		where t1.column_id=t2.column_id),
	column_2_for_update=(select column_2_source from update_source t2 
		where t1.column_id=t2.column_id)
	where exists (select 1 from update_source t3 where t3.column_id=t1.column_id);
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on update_target t1  (cost=14461.94..81679.34 rows=10240 width=16) (actual time=1454.999..1454.999 rows=0 loops=1)
   ->  Hash Join  (cost=14461.94..81679.34 rows=10240 width=16) (actual time=1179.335..1331.664 rows=10237 loops=1)
         Hash Cond: (t3.column_id = t1.column_id)
         ->  Seq Scan on update_source t3  (cost=0.00..172.02 rows=10240 width=10) (actual time=0.013..7.162 rows=10240 loops=1)
         ->  Hash  (cost=11054.07..11054.07 rows=1310720 width=10) (actual time=1178.699..1178.699 rows=1310720 loops=1)
               Buckets: 131072  Batches: 1  Memory Usage: 53760kB
               ->  Seq Scan on update_target t1  (cost=0.00..11054.07 rows=1310720 width=10) (actual time=0.011..710.234 rows=1310720 loops=1)
         SubPlan 1
           ->  Index Scan using update_source_pkey on update_source t2  (cost=0.00..3.27 rows=1 width=30) (actual time=0.004..0.004 rows=1 loops=10237)
                 Index Cond: (t1.column_id = column_id)
         SubPlan 2
           ->  Index Scan using update_source_pkey on update_source t2  (cost=0.00..3.27 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=10237)
                 Index Cond: (t1.column_id = column_id)
 Total runtime: 1468.033 ms
(14 rows)

The plan would be bit simpler if we had to update just one column.

edb=#  explain analyze update update_target t1 
	set columne_to_update =(select source_for_update from update_source t2 
		where t1.column_id=t2.column_id)
	where exists (select 1 from update_source t3 where t3.column_id=t1.column_id);
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
--
 Update on update_target t1  (cost=20556.94..54328.97 rows=10240 width=20) (actual time=1222.157..1222.157 rows=0 loops=1)
   ->  Hash Join  (cost=20556.94..54328.97 rows=10240 width=20) (actual time=1002.328..1105.557 rows=10237 loops=1)
         Hash Cond: (t3.column_id = t1.column_id)
         ->  Seq Scan on update_source t3  (cost=0.00..172.02 rows=10240 width=10) (actual time=0.013..7.244 rows=10240 loops=1)
         ->  Hash  (cost=17149.07..17149.07 rows=1310720 width=14) (actual time=1001.109..1001.109 rows=1310720 loops=1)
               Buckets: 131072  Batches: 1  Memory Usage: 53800kB
               ->  Seq Scan on update_target t1  (cost=0.00..17149.07 rows=1310720 width=14) (actual time=0.011..521.482 rows=1310720 loops=1)
         SubPlan 1
           ->  Index Scan using update_source_pkey on update_source t2  (cost=0.00..3.27 rows=1 width=30) (actual time=0.004..0.004 rows=1 loops=10237
)
                 Index Cond: (t1.column_id = column_id)
 Total runtime: 1237.864 ms
(11 rows)
The difference in response time is of around 12.5%, just because of the additional column.

Let's check if this update goes as per our expectation.

edb=# update update_target t1 
	set columne_to_update =(select source_for_update from update_source t2 
		where t1.column_id=t2.column_id),
	column_2_for_update=(select column_2_source from update_source t2 
		where t1.column_id=t2.column_id)
	where exists (select 1 from update_source t3 where t3.column_id=t1.column_id);
 UPDATE 10237
 
edb=# select count(*) from update_target where columne_to_update is null;
  count
 -------
      0
 (1 row)
Very few people might know that in PostgreSQL the actual syntax for Update command is as below:

[ WITH [ RECURSIVE ] with_query [, ...] ]
UPDATE [ ONLY ] table_name [ * ] [ [ AS ] alias ]
    SET { column_name = { expression | DEFAULT } |
          ( column_name [, ...] ) = ( { expression | DEFAULT } [, ...] ) } [, ...]
    [ FROM from_list ]
    [ WHERE condition | WHERE CURRENT OF cursor_name ]
    [ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]
The highlighted part of this syntax is partcularly important. It helps you update records beased on join. So for our requirement a much better and cleaner code is as below:
update update_target t1 
	set columne_to_update=t2.source_for_update ,
	t1.column_2_for_update=t2.column_2_source
	from update_source t2 where t1.column_id=t2.column_id;
	
or 

update update_target t1 
	set (columne_to_update,t1.column_2_for_update)=(t2.source_for_update ,t2.column_2_source)
	from update_source t2 where t1.column_id=t2.column_id;
The plan for above option is
								QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Update on update_target t1  (cost=14688.35..14950.99 rows=10240 width=50) (actual time=1095.327..1095.327 rows=0 loops=1)
   ->  Hash Join  (cost=14688.35..14950.99 rows=10240 width=50) (actual time=945.722..973.844 rows=10237 loops=1)
         Hash Cond: (t2.column_id = t1.column_id)
         ->  Seq Scan on update_source t2  (cost=0.00..172.02 rows=10240 width=44) (actual time=0.012..7.936 rows=10240 loops=1)
         ->  Hash  (cost=11227.12..11227.12 rows=1331239 width=10) (actual time=944.189..944.189 rows=1310720 loops=1)
               Buckets: 262144  Batches: 1  Memory Usage: 53760kB
               ->  Seq Scan on update_target t1  (cost=0.00..11227.12 rows=1331239 width=10) (actual time=0.009..490.419 rows=1310720 loops=1)
 Total runtime: 1111.271 ms
(8 rows)
Let's see how much of an overhead was caused cause of multiple columns. We test it by generating a plan for update of only one column.

edb=# explain analyze update update_target t1 set columne_to_update=t2.source_for_update from update_source t2 where t1.column_id=t2.column_id;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Update on update_target t1  (cost=15719.74..15982.38 rows=10240 width=46) (actual time=1124.990..1124.990 rows=0 loops=1)
   ->  Hash Join  (cost=15719.74..15982.38 rows=10240 width=46) (actual time=977.541..1005.278 rows=10237 loops=1)
         Hash Cond: (t2.column_id = t1.column_id)
         ->  Seq Scan on update_source t2  (cost=0.00..172.02 rows=10240 width=40) (actual time=0.236..7.509 rows=10240 loops=1)
         ->  Hash  (cost=12015.47..12015.47 rows=1424717 width=10) (actual time=976.187..976.187 rows=1310720 loops=1)
               Buckets: 262144  Batches: 1  Memory Usage: 53760kB
               ->  Seq Scan on update_target t1  (cost=0.00..12015.47 rows=1424717 width=10) (actual time=0.012..524.587 rows=1310720 loops=1)
 Total runtime: 1141.219 ms
(8 rows)
Surprisingly there is hardly any difference in response time and cost as well has not sufferred much. So I can be assured that even if I have to update a third column, the plan will remain more or less the same.
There are some RDBMS implementations which has a dedicated statement (MERGE) for achieving UPSERT (Update/Insert based on some criteria), via that they support something similar to what I have explained here.
Now, I know there will be people talking and commenting about lack of UPSERT in PostgreSQL. Well, I don't quite agree to that. Do you know you can return the details from an UPDATE query and PostgreSQL v9.1 (and onwards) allows you to capitalize that returned in data to insert/delete records. So, it makes it a bit more powerful than the MERGE query. Below is an example.
If you remember there were three rows which did not exist at update_target but were present in update_source (if not then just scroll up)
-- Using RETURNING to achieve an UPSERT/MERGE
edb=# with updated as (update update_target t1 set columne_to_update=t2.source_for_update from update_source t2 where t1.column_id=t2.column_id
	returning t1.column_id)
	insert into update_target
	select * from update_source t_src where t_src.column_id not in
	(select column_id from updated);
INSERT 0 3
I hope now before writing a complex subquery to update your data, you will first try to explore the possibility of using FROM clause. Well I will expect happy users of PostgreSQL to stop complaining about absence of UPSERT feature.
PS: The flexibility explained above is not just limited to UPDATE, it is also available to you when you are doing the deletion. There is a USING clause in DELETE statement.
A detailed documentation for UPDATE and DELETE commands can be found on PostgreSQL website.