wtorek, 7 kwietnia 2015

Fuzzystrmatch: Why is difference() slow and how to make it faster.

PostgreSQL manual lists this example for the difference function:
SELECT * FROM s WHERE difference(s.nm, 'john') > 2;
The equivalent and faster way to do it is:
SELECT * FROM s WHERE (4 - levenshtein(soundex(s.nm),soundex('john'),99,99,1)) > 2; 
The test data:
CREATE TABLE fuzzy_test AS 
  SELECT md5(i::text) AS str
  FROM generate_series(1,1000000) AS i;
The test:
postgres=# \timing
Timing is on.
postgres=# SELECT Count(Difference(str,md5('1'))) FROM fuzzy_test;
  count
---------
 1000000
(1 row)


Time: 3160,953 ms
postgres=# SELECT Count(Levenshtein(Soundex(str),Soundex(Md5('1')),99,99,1)) FROM fuzzy_test;
  count
---------
 1000000
(1 row)


Time: 2077,191 ms
Why? Because using difference() pg has to calculate the soundex for 'john' on each row and in the other query it's evaluated only once, if we wouldn't use a static parameter difference() would be faster.