robbat2: (Default)
robbat2 ([personal profile] robbat2) wrote2007-06-12 03:41 pm

PostgreSQL Vs. MySQL for INSERT IGNORE + REPLACE - stored procedures, savepoints and beyond

So lately I've doing a variety of development that uses PostgreSQL. It's come a long way since I last used it a couple of years ago, esp. in the realm of performance. The auto-complete in the psql tool is also a very nice touch, I can get it confused occasionally, but the auto-complete in the similar mysql tool is far less advanced.

There is however one thing that still bugs me, and it's turned up in plenty of other places. This is the issue of INSERT IGNORE/REPLACE that MySQL supports, but PostgreSQL doesn't. The widely documented way to handle these is to used a stored procedure instead, but that has a major shortcoming: it can only insert a single value at a time. This is a major performance limitation, and also suffers concurrency issues. INSERT IGNORE and REPLACE handle multi-valued inserted much more gracefully.

INSERT IGNORE is defined as inserting a data tuple, and if the primary key already exists, that singular tuple is not applied to the table. REPLACE is defined as insert a data tuple, and if the primary key already exists, using the remaining data in the tuple to perform an UPDATE statement.

If anybody has got other ideas that make handling REPLACE and INSERT IGNORE in PostgreSQL easier, I'd love to hear them.

MySQL version

Below is an example:

-- Create our table first.
CREATE TABLE t ( 
 i1 INT, 
 i2 INT, 
 s VARCHAR(64), 
 PRIMARY KEY(i1,i2) );
-- Insert some data
INSERT INTO t (i1,i2,s) VALUES
 (1,1,'foo'),
 (1,2,'bar'),
 (2,1,'zap');
-- INSERT IGNORE example
INSERT IGNORE INTO t (i1,i2,s) VALUES
 (1,1,'skipped 1'),
 (2,2,'yes'),
 (2,1,'skipped 2'),
 (2,3,'yes again');
-- REPLACE example
REPLACE INTO t (i1,i2,s) VALUES
 (1,1,'replaced 1'),
 (1,3,'totally new'),
 (1,2,'replaced 2');

That's enough example for now. The first logical way to solve this would seem to be BEGIN; INSERT INTO t (...); INSERT INTO (...); COMMIT; and ignore the duplicate errors, but unfortunately cause the transaction to rollback.

Stored procedures

The classical stored procedure approach to implementing the REPLACE equivalently, is below. If you wanted to implement INSERT IGNORE instead, you'd drop the UPDATE line.

CREATE OR REPLACE FUNCTION replace_into(INT,INT,VARCHAR(64)) RETURNS bool AS $$
DECLARE tmprow RECORD;
f_i1 ALIAS FOR $1;
f_i2 ALIAS FOR $2;
f_s ALIAS FOR $3;

BEGIN
SELECT INTO tmprow * FROM t WHERE i1=f_i1 AND i2=f_i2;
IF FOUND THEN 
 UPDATE t SET s=f_s WHERE i1=f_i1 AND i2=f_i2;
 RETURN 'f';
ELSE
 INSERT INTO t (i1,i2,s) VALUES (f_i1,f_i2,f_s);
 RETURN 't';
END IF;
END;
$$ LANGUAGE plpgsql;

-- now we use it as follows:
SELECT replace_into(1,1,'replaced 1');

There are a couple of problems with this. If you want to insert multiple rows, this means that you have multiple transactions to perform - this may be a definite performance concern - N pairs of (SELECT, UPDATE or INSERT) must by definition perform worse than a single large INSERT that just leaves out a few of it's values. You also need to write the stored procedure for every table where you need to use INSERT IGNORE or REPLACE, which just sucks. Additionally, because multiple transactions are used, there is the danger of concurrency and race issues depending on the application that uses your data. If you have to ensure a set of rows is in the database, and one of which already exists, but the next 2 must only exist together, you are in serious trouble - PostgreSQL doesn't support nested transactions until version 8.0.

Save-points

If you are lucky enough to use PostgreSQL newer than 8.0, you get nested transactions, and, the UPDATE documentation contains this gem using them:

BEGIN;
-- other operations
SAVEPOINT sp1;
INSERT INTO wines VALUES('Chateau Lafite 2003', '24');
-- Assume the above fails because of a unique key violation,
-- so now we issue these commands:
ROLLBACK TO sp1;
UPDATE wines SET stock = stock + 24 WHERE winename = 'Chateau Lafite 2003';
-- continue with other operations, and eventually
COMMIT;

Using SAVEPOINT would make life easier, but we still have the performance issue of not being able to use a multi-row INSERT. It does have the advantage that you don't need to write custom stored procedures for each table, but you're still stuck with writing custom code for each REPLACE. SAVEPOINT works wonderful as an INSERT IGNORE statement from a programmatic point of view:

Firstly define your prepared INSERT statement to take a single row, with the classic ? in the values tuple. Start a transaction with a BEGIN, then for each row you wish to insert, take a save-point, run the prepared statement with the values, if you hit a key violation, issue rollback to the save-point, otherwise take the next save-point and continue.

So now we have INSERT IGNORE covered in a programmatic sense. You can implement the previous paragraph as a function taking the single-row insert, along with an array of arrays.

We are still stuck on the matter of the REPLACE statement however. Without introspecting our database (a-l´-Ruby-On-Rails), we cannot automatically identify the columns that form the unique or primary key(s). The ultimate objective here is to find a way that the beyond the initial keyword of REPLACE, the query can be handled in a transparent and efficient manner. Introspection would allow automatic creation of the SAVEPOINT+UPDATE above, but that might not be the most efficent way of doing it.

Temporary tables?

If we consider introspecting the database and the REPLACE query, there is a potential that we may gain performance. The following snippet is probably not valid PostgreSQL, but merely serves to illustrate the idea.

-- Create a temporary table.
-- MySQL would do this in memory, but I don't know about PostgreSQL.
CREATE TEMPORARY TABLE t_temp LIKE temp;
-- Now using the exact original query, we can insert all of those values 
-- directly into the temporary table, as there will not be any conflicts.
INSERT INTO t_temp (i1,i2,s) VALUES (...row1...),(...row2...);
-- Now comes the magic
-- First of all we handle the rows that do conflict.
-- If you were doing INSERT IGNORE, you'd leave this line out.
-- We need to know the keys that comprise the unique/primary key.
UPDATE t SET (i1,i2,s) FROM (SELECT t_temp.* FROM t_temp JOIN t USING (i1,i2));
-- Now we handle the distinct rows that don't conflict
INSERT INTO t SELECT t_temp.* FROM t_temp WHERE (i1,i2) NOT IN (SELECT i1,i2 FROM t);

The above requires that the query can be decomposed to identify the primary/unique keys, including those that might not be explicitly specified such as a SERIAL column. Depending on the size of the insert and the existing table, having good query optimizer/planner performance will be critically important.


Post a comment in response:

This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org