module Text::Levenshtein

Public Instance Methods

distance(str1, str2) click to toggle source

Calculate the Levenshtein distance between two strings str1 and str2.

In Ruby 1.8, str1 and str2 should be ASCII, UTF-8, or a one-byte-per character encoding such as ISO-8859-*. They will be treated as UTF-8 if $KCODE is set appropriately (i.e. 'u'). Otherwise, the comparison will be performed byte-by-byte. There is no specific support for Shift-JIS or EUC strings.

In Ruby 1.9+, the strings will be processed as UTF-8.

When using Unicode text, be aware that this algorithm does not perform normalisation. If there is a possibility of different normalised forms being used, normalisation should be performed beforehand.

# File lib/text/levenshtein.rb, line 32
def distance(str1, str2)
  prepare =
    if "ruby".respond_to?(:encoding)
      lambda { |str| str.encode(Encoding::UTF_8).unpack("U*") }
    else
      rule = $KCODE.match(/^U/) ? "U*" : "C*"
      lambda { |str| str.unpack(rule) }
    end

  s, t = [str1, str2].map(&prepare)
  n = s.length
  m = t.length
  return m if n.zero?
  return n if m.zero?

  d = (0..m).to_a
  x = nil

  n.times do |i|
    e = i + 1
    m.times do |j|
      cost = (s[i] == t[j]) ? 0 : 1
      x = [
        d[j+1] + 1, # insertion
        e + 1,      # deletion
        d[j] + cost # substitution
      ].min
      d[j] = e
      e = x
    end
    d[m] = x
  end

  return x
end