Function std.numeric.gapWeightedSimilarity
The so-called "all-lengths gap-weighted string kernel" computes a
similarity measure between
and s
based on all of their
common subsequences of all lengths. Gapped subsequences are also
included.
t
To understand what
computes,
consider first the case gapWeightedSimilarity
(s
, t
, lambda
)
and the strings lambda
= 1
and s
=
["Hello", "brave", "new", "world"]
. In that case, t
= ["Hello", "new",
"world"]
counts the
following matches:
gapWeightedSimilarity
- three matches of length 1, namely
"Hello"
,"new"
, and"world"
; - three matches of length 2, namely (
"Hello", "new"
), ("Hello", "world"
), and ("new", "world"
); - one match of length 3, namely (
"Hello", "new", "world"
).
The call
simply counts all of
these matches and adds them up, returning 7.
gapWeightedSimilarity
(s
, t
, 1)
string[] s = ["Hello", "brave", "new", "world"]; string[] t = ["Hello", "new", "world"]; assert(gapWeightedSimilarity(s, t, 1) == 7);
Note how the gaps in matching are simply ignored, for example ("Hello", "new"
) is deemed as good a match as ("new",
"world"
). This may be too permissive for some applications. To
eliminate gapped matches entirely, use
:
lambda
= 0
string[] s = ["Hello", "brave", "new", "world"]; string[] t = ["Hello", "new", "world"]; assert(gapWeightedSimilarity(s, t, 0) == 4);
The call above eliminated the gapped matches ("Hello", "new"
),
("Hello", "world"
), and ("Hello", "new", "world"
) from the
tally. That leaves only 4 matches.
The most interesting case is when gapped matches still participate in
the result, but not as strongly as ungapped matches. The result will
be a smooth, fine-grained similarity measure between the input
strings. This is where values of
between 0 and 1 enter
into play: gapped matches are exponentially penalized with the
number of gaps with base lambda
. This means that an ungapped
match adds 1 to the return value; a match with one gap in either
string adds lambda
to the return value; ...; a match with a total
of lambda
n
gaps in both strings adds pow(
to the return
value. In the example above, we have 4 matches without gaps, 2 matches
with one gap, and 1 match with three gaps. The latter match is (lambda
, n)"Hello", "world"
), which has two gaps in the first string and one gap
in the second string, totaling to three gaps. Summing these up we get
4 + 2 *
.
lambda
+ pow(lambda
, 3)
string[] s = ["Hello", "brave", "new", "world"]; string[] t = ["Hello", "new", "world"]; assert(gapWeightedSimilarity(s, t, 0.5) == 4 + 0.5 * 2 + 0.125);
is useful wherever a smooth similarity
measure between sequences allowing for approximate matches is
needed. The examples above are given with words, but any sequences
with elements comparable for equality are allowed, e.g. characters or
numbers. gapWeightedSimilarity
uses a highly optimized dynamic
programming implementation that needs gapWeightedSimilarity
16 * min(s.length,
t.length)
extra bytes of memory and Ο(s.length * t.length
) time
to complete.
Prototype
F gapWeightedSimilarity(alias comp, R1, R2, F)( R1 s, R2 t, F lambda ) if (isRandomAccessRange!R1 && hasLength!R1 && isRandomAccessRange!R2 && hasLength!R2);
Authors
Andrei Alexandrescu, Don Clugston, Robert Jacques