Dynamic Programming with Strings

In this post, I'm going to discuss the "Edit Distance" problem, along with a specific case of it - the "Longest Common Subsequence" problem. 

Edit Distance

The problem: Given two strings "a" and "b" - find the smallest number of edits we need to transform "a" into "b." The possible types of edits are "insertion," "deletion," and "replacement." Assume all the edits have the same cost of 1. 

The approach: We can break this problem down into prefix or suffix subproblems. (It doesn't matter which one we pick - the approaches are equivalent.) If we think in terms of prefixes, we can call E(i, j) the minimum cost of transforming the first i characters of "a" into the first j characters of "b." There are four ways we can reduce E(i, j) into smaller subproblems... 

1) If the ith character of "a" and the jth character of "b" are already equal, then we don't have to do anything! Then, E(i, j) is reduced to E(i-1, j-1) without any cost. 

2) We can delete the ith character of "a" and transform the first i-1 characters of "a" into the first j characters of "b." Then, E(i, j) is reduced to 1 + E(i-1, j)

3) We can transform the first i characters of "a" into the first j-1 characters of "b" and insert the jth character of "b." Then, E(i, j) is reduced to E(i, j-1) + 1

4) Finally, we can transform the first i-1 characters of "a" into the first j-1 characters of "b" and replace the ith character of "a" with the jth character of "b." Then, E(i, j) is reduced to E(i-1, j-1) + 1.


This approach is O(mn), where m and n are the respective lengths of the two strings. Written in code, this solution looks like... 

static int[][] tab; 

public static int minEditDistance(String a, String b) {
for(int i = 0; i <= a.length(); i++) {
for(int j = 0; j <= b.length(); j++) {
if(i == 0) tab[i][j] = j; //delete all j characters of b
else if(j == 0) tab[i][j] = i; //delete all i characters of a
else if(a.charAt(i-1) == b.charAt(j-1)) tab[i][j] = tab[i-1][j-1];
else tab[i][j] = 1 + Math.min(Math.min(tab[i-1][j], tab[i][j-1]), tab[i-1][j-1]); 
}
}
return tab[a.length()][b.length()]; 
}

Longest Common Subsequence 

The LCS problem is a special case of the Edit Distance problem where the insertion cost and deletion cost are 1, and the replacement cost is 0 if the two characters are equal and infinity if the two characters are different. 

The problem: Given two strings "a" and "b", find the length of their longest common sequence. 

The approach: Call L(i, j) the length of the LCS of the first i characters in "a" and the first j characters in "b." There are two cases:

1) If the ith character of "a" and the jth character of "b" are equal, then L(i, j) is reduced to 1 + L(i-1, j-1).

2) If they are different, then L(i,j) is reduced to the maximum of L(i-1, j) and L(i, j-1)

My code for this approach is below. Note that if i or j is 0, then the LCS is 0 (there are no common characters because there are no characters in the first place.) 

public static int lcs(String a, String b) {
for(int i = 1; i <= a.length(); i++) {
for(int j = 1; j <= b.length(); j++) {
if(a.charAt(i-1) == b.charAt(j-1)) {
tab[i][j] = 1 + tab[i-1][j-1];
}
else tab[i][j] = Math.max(tab[i][j-1], tab[i-1][j]);
}
}
return tab[a.length()][b.length()]; 

All code on my Github - try running it yourself!

Source(s): MIT OCW, Clemson

Popular posts from this blog

Space Race: a simple Rust game built on the Bevy framework

Building A Toy Language Interpreter in Go

Building a Toy Language Compiler in Go