robbat2: (Default)
robbat2 ([personal profile] robbat2) wrote2007-06-12 03:41 pm

PostgreSQL Vs. MySQL for INSERT IGNORE + REPLACE - stored procedures, savepoints and beyond

So lately I've doing a variety of development that uses PostgreSQL. It's come a long way since I last used it a couple of years ago, esp. in the realm of performance. The auto-complete in the psql tool is also a very nice touch, I can get it confused occasionally, but the auto-complete in the similar mysql tool is far less advanced.

There is however one thing that still bugs me, and it's turned up in plenty of other places. This is the issue of INSERT IGNORE/REPLACE that MySQL supports, but PostgreSQL doesn't. The widely documented way to handle these is to used a stored procedure instead, but that has a major shortcoming: it can only insert a single value at a time. This is a major performance limitation, and also suffers concurrency issues. INSERT IGNORE and REPLACE handle multi-valued inserted much more gracefully.

INSERT IGNORE is defined as inserting a data tuple, and if the primary key already exists, that singular tuple is not applied to the table. REPLACE is defined as insert a data tuple, and if the primary key already exists, using the remaining data in the tuple to perform an UPDATE statement.

If anybody has got other ideas that make handling REPLACE and INSERT IGNORE in PostgreSQL easier, I'd love to hear them.

MySQL version

Below is an example:

-- Create our table first.
CREATE TABLE t ( 
 i1 INT, 
 i2 INT, 
 s VARCHAR(64), 
 PRIMARY KEY(i1,i2) );
-- Insert some data
INSERT INTO t (i1,i2,s) VALUES
 (1,1,'foo'),
 (1,2,'bar'),
 (2,1,'zap');
-- INSERT IGNORE example
INSERT IGNORE INTO t (i1,i2,s) VALUES
 (1,1,'skipped 1'),
 (2,2,'yes'),
 (2,1,'skipped 2'),
 (2,3,'yes again');
-- REPLACE example
REPLACE INTO t (i1,i2,s) VALUES
 (1,1,'replaced 1'),
 (1,3,'totally new'),
 (1,2,'replaced 2');

That's enough example for now. The first logical way to solve this would seem to be BEGIN; INSERT INTO t (...); INSERT INTO (...); COMMIT; and ignore the duplicate errors, but unfortunately cause the transaction to rollback.

Stored procedures

The classical stored procedure approach to implementing the REPLACE equivalently, is below. If you wanted to implement INSERT IGNORE instead, you'd drop the UPDATE line.

CREATE OR REPLACE FUNCTION replace_into(INT,INT,VARCHAR(64)) RETURNS bool AS $$
DECLARE tmprow RECORD;
f_i1 ALIAS FOR $1;
f_i2 ALIAS FOR $2;
f_s ALIAS FOR $3;

BEGIN
SELECT INTO tmprow * FROM t WHERE i1=f_i1 AND i2=f_i2;
IF FOUND THEN 
 UPDATE t SET s=f_s WHERE i1=f_i1 AND i2=f_i2;
 RETURN 'f';
ELSE
 INSERT INTO t (i1,i2,s) VALUES (f_i1,f_i2,f_s);
 RETURN 't';
END IF;
END;
$$ LANGUAGE plpgsql;

-- now we use it as follows:
SELECT replace_into(1,1,'replaced 1');

There are a couple of problems with this. If you want to insert multiple rows, this means that you have multiple transactions to perform - this may be a definite performance concern - N pairs of (SELECT, UPDATE or INSERT) must by definition perform worse than a single large INSERT that just leaves out a few of it's values. You also need to write the stored procedure for every table where you need to use INSERT IGNORE or REPLACE, which just sucks. Additionally, because multiple transactions are used, there is the danger of concurrency and race issues depending on the application that uses your data. If you have to ensure a set of rows is in the database, and one of which already exists, but the next 2 must only exist together, you are in serious trouble - PostgreSQL doesn't support nested transactions until version 8.0.

Save-points

If you are lucky enough to use PostgreSQL newer than 8.0, you get nested transactions, and, the UPDATE documentation contains this gem using them:

BEGIN;
-- other operations
SAVEPOINT sp1;
INSERT INTO wines VALUES('Chateau Lafite 2003', '24');
-- Assume the above fails because of a unique key violation,
-- so now we issue these commands:
ROLLBACK TO sp1;
UPDATE wines SET stock = stock + 24 WHERE winename = 'Chateau Lafite 2003';
-- continue with other operations, and eventually
COMMIT;

Using SAVEPOINT would make life easier, but we still have the performance issue of not being able to use a multi-row INSERT. It does have the advantage that you don't need to write custom stored procedures for each table, but you're still stuck with writing custom code for each REPLACE. SAVEPOINT works wonderful as an INSERT IGNORE statement from a programmatic point of view:

Firstly define your prepared INSERT statement to take a single row, with the classic ? in the values tuple. Start a transaction with a BEGIN, then for each row you wish to insert, take a save-point, run the prepared statement with the values, if you hit a key violation, issue rollback to the save-point, otherwise take the next save-point and continue.

So now we have INSERT IGNORE covered in a programmatic sense. You can implement the previous paragraph as a function taking the single-row insert, along with an array of arrays.

We are still stuck on the matter of the REPLACE statement however. Without introspecting our database (a-l´-Ruby-On-Rails), we cannot automatically identify the columns that form the unique or primary key(s). The ultimate objective here is to find a way that the beyond the initial keyword of REPLACE, the query can be handled in a transparent and efficient manner. Introspection would allow automatic creation of the SAVEPOINT+UPDATE above, but that might not be the most efficent way of doing it.

Temporary tables?

If we consider introspecting the database and the REPLACE query, there is a potential that we may gain performance. The following snippet is probably not valid PostgreSQL, but merely serves to illustrate the idea.

-- Create a temporary table.
-- MySQL would do this in memory, but I don't know about PostgreSQL.
CREATE TEMPORARY TABLE t_temp LIKE temp;
-- Now using the exact original query, we can insert all of those values 
-- directly into the temporary table, as there will not be any conflicts.
INSERT INTO t_temp (i1,i2,s) VALUES (...row1...),(...row2...);
-- Now comes the magic
-- First of all we handle the rows that do conflict.
-- If you were doing INSERT IGNORE, you'd leave this line out.
-- We need to know the keys that comprise the unique/primary key.
UPDATE t SET (i1,i2,s) FROM (SELECT t_temp.* FROM t_temp JOIN t USING (i1,i2));
-- Now we handle the distinct rows that don't conflict
INSERT INTO t SELECT t_temp.* FROM t_temp WHERE (i1,i2) NOT IN (SELECT i1,i2 FROM t);

The above requires that the query can be decomposed to identify the primary/unique keys, including those that might not be explicitly specified such as a SERIAL column. Depending on the size of the insert and the existing table, having good query optimizer/planner performance will be critically important.

Insert ignore.

(Anonymous) 2007-06-25 10:10 am (UTC)(link)
For quite simple tables it is also possible to use a subquery to limit what we want to insert. Since insert ignore does not insert rows violating any constraints, this should be more or less functionally equivalent.

Example:

create table foo ( id integer not null, dt integer not null, constraint foo_pkey primary key(id, dt));
create table bar ( id integer not null, dt integer not null, constraint bar_pkey primary key(id, dt));

insert into foo values(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2);
insert into bar values(1, 1), (1, 2), (3, 1), (3, 2);

insert into bar select id, dt from foo where (id, dt) not in (select id, dt from bar);

Re: Insert ignore.

[identity profile] robbat2.livejournal.com 2007-06-25 09:37 pm (UTC)(link)
This is simply a variant of what I discussed in the "Temporary table" section.

Been searching for this as well

[identity profile] lotso.livejournal.com 2007-08-14 08:51 am (UTC)(link)
Seems like I'm out of luck? This will greatly hamper anything which I want to do with the data.

Re: Been searching for this as well

(Anonymous) 2007-08-14 09:03 am (UTC)(link)
yeah, SOL still.

[identity profile] gemmawasyl.livejournal.com 2008-07-13 03:17 pm (UTC)(link)
If I wanted to store my data on the user's drive or on a server, though, I'm pretty-much out of luck.

Re: this may be an alternative

(Anonymous) 2007-08-14 05:04 pm (UTC)(link)
This is a nicer shortcut than the stored procedures, but it has two downsides (which were mentioned in the original post):
1. You need to declare the RULE for every table seperately.
2. The RULE converts _every_ INSERT to a REPLACE (or an ignore). You cannot easily flip between normal inserts, insert-ignore and replace on the same table.

Re: this may be an alternative

(Anonymous) 2008-06-16 08:48 pm (UTC)(link)
how about using Exception?

http://www.postgresql.org/docs/8.0/static/plpgsql-control-structures.html

search for "EXCEPTION WHEN unique_violation THEN"

Re: this may be an alternative

[identity profile] robbat2.livejournal.com 2008-06-16 08:58 pm (UTC)(link)
Mind converting the entire MySQL example? I don't see how it can preserve the performance in the case of multiple-row inserts.

INSERT IGNORE solution

[identity profile] pornel.net (from livejournal.com) 2008-08-05 04:09 pm (UTC)(link)
CREATE OR REPLACE RULE "insert_ignore" AS ON INSERT TO table WHERE EXISTS(SELECT true FROM table WHERE key = NEW.key) DO INSTEAD NOTHING;

This should be able preserve performance of multi-row INSERT IGNORE queries, however I haven't verified that, because my old version of Postgres doesn't support multi-row inserts at all :D

Re: INSERT IGNORE solution

[identity profile] robbat2.livejournal.com 2008-08-05 04:45 pm (UTC)(link)
Thanks, thats an interesting one I hadn't see. Pity you have to do it for every single table, but it's a good start.

Re: INSERT IGNORE solution

[identity profile] wasd42.livejournal.com 2008-08-19 07:22 am (UTC)(link)
That's great, but it makes _every_ query to be INSERT IGNORE, is it not?

Re: INSERT IGNORE solution

[identity profile] robbat2.livejournal.com 2008-08-19 05:04 pm (UTC)(link)
Argh, yeah, you'd be right there. I think the original commentor missed that.

Re: INSERT IGNORE solution

(Anonymous) 2009-01-21 01:42 am (UTC)(link)
In postgres:

select set_myObject(src.f1, src.f2, src.f3)
FROM source_table src;

where set_myObject performs INSERT/IGNORE functionality and returns void (or a scalar value, say the key) and f1, f2, and f3 are fields from your select.

Postgres is much more powerful than you realize.

Re: INSERT IGNORE solution

(Anonymous) 2009-01-29 02:00 am (UTC)(link)
1) in which case you would need to fill in a source table first.
2) in which case you would have to create separate set_myObject() for each table you want this functionality on.

I totally agree.

[identity profile] https://me.yahoo.com/theevancarroll#14e15 (from livejournal.com) 2009-03-20 05:17 pm (UTC)(link)
This is definitely something Postgres needs. Pg is *really* lacking in the data-loading department.

For more information on the COPY, and its drawbacks, check out the wiki.
http://wiki.postgresql.org/wiki/COPY

Re: I totally agree.

[identity profile] robbat2.livejournal.com 2009-03-20 07:57 pm (UTC)(link)
COPY doesn't help either. See that I want to be able to insert multiple values with INSERT IGNORE _or_ REPLACE, on the same table, depending on the workload.

(Anonymous) 2010-01-28 06:09 am (UTC)(link)
Don't blame PostgreSQL. INSERT IGNORE is simply not standard SQL, it's a MySQL eccentricity.

insert into (blah blah) WHERE NOT EXISTS...

[identity profile] robbat2.livejournal.com 2010-01-28 06:43 am (UTC)(link)
Please identify yourself if you're going to comment.

"Using an INSERT .. SELECT .. FROM DUAL ... WHERE NOT EXISTS" is subject to the same limitation as the stored procedure example.
- special variant of every INSERT query.
- no multiple rows in a single insert

It also doesn't give us the ability to do REPLACE.

(Anonymous) 2010-03-16 09:14 pm (UTC)(link)
Using values function to build a table on the fly appears to work:

create table foo(
bid int primary key,
rid int);

insert into foo (bid, rid) values (1,2);

insert into foo (bid, rid) values (2,3);

insert into foo
select bid, rid
from (values (1, 2),(2,3),(3,4)) as li(bid, rid)
where (bid, rid) not in (select bid, rid from cms.test);

[identity profile] robbat2.livejournal.com 2010-03-16 10:32 pm (UTC)(link)
That's a variant of the temporary table approach, but one that's quite interesting, as the table is now implicit.

(Anonymous) 2010-07-15 06:36 pm (UTC)(link)
http://people.planetpostgresql.org/dfetter/index.php?/archives/48-Adding-Only-New-Rows-INSERT-IGNORE,-Done-Right.html

[identity profile] jefnik.livejournal.com 2012-07-21 02:59 am (UTC)(link)
Kudos to Anonymous above, for the simple INSERT IGNORE solution.

In the same vein, here is an alternate Postgres solution for REPLACE, using neither rules nor stored procedures nor any explicitly created temporary tables, and which supports multiple insert records:




Note that this "REPLACE" alternative differs slightly from the MySQL description in that the whole set of DELETEs is done in bulk at once, before any INSERT. This has the advantage that it avoids any possible collision on secondary "unique" constraints involving records in the set meant to be updated. For example, say that there is a non-primary-key unique-constraint on field "uf", and the REPLACE intends to swap the values of "uf" in the first two records. If the deleting/inserting is done one record at a time, then the change to the first record will fail, because once completed, it will leave both of the two first records with the same "uf" value. If, however, the DELETEs are done in bulk before the inserts, then both records #1 and #2 will be gone before any insert takes place, so the collision will not occur.
Edited 2012-07-21 03:06 (UTC)