Motivation
ClosestSubstring is the window-selection version of the closest-string problem and is not currently in the catalog.
It is useful because it models motif discovery: each input string may contain a length-ell occurrence of a hidden pattern, and the goal is to find a center pattern minimizing the worst Hamming distance to one selected window in each string.
Associated rule:
Definition
Name: ClosestSubstring
Reference: Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150
Let Sigma = {0, ..., q-1} be a finite alphabet. The input is:
- strings
s_1, ..., s_n over Sigma, not necessarily of equal length,
- a substring length
ell, with ell <= |s_i| for every input string.
A solution consists of:
- a center string
c in Sigma^ell,
- for each input string
s_i, a start position p_i selecting the length-ell substring s_i[p_i .. p_i + ell).
The objective is to minimize:
max_i d_H(c, s_i[p_i .. p_i + ell)).
So this is a minimization problem.
Variables
Let q = alphabet_size, ell = substring_length, and W_i = |s_i| - ell + 1 be the number of length-ell windows in string s_i.
- Center variables:
ell variables, each in {0, ..., q-1}
- Window variables: one start-position variable per input string, where variable
p_i has domain {0, ..., W_i - 1}
A configuration is syntactically feasible if the center symbols are in the alphabet and each selected start position is valid. Its objective value is the maximum Hamming distance between the center and the selected windows.
Schema (data type)
Type name: ClosestSubstring
| Field |
Mathematical type |
Description |
alphabet_size |
positive integer |
Size q of the finite alphabet |
strings |
list of strings over {0,...,q-1} |
Input strings s_1,...,s_n |
substring_length |
nonnegative integer |
Window length ell |
Suggested size fields:
alphabet_size
num_strings
substring_length
total_length = sum_i |s_i|
total_num_windows = sum_i (|s_i| - substring_length + 1)
Input validation should require:
alphabet_size > 0 unless all relevant lengths are zero,
substring_length <= |s_i| for every string,
- every symbol is
< alphabet_size.
Complexity
The direct exact baseline enumerates all center strings and all window choices.
This gives:
O(alphabet_size^substring_length * product_i(|s_i| - substring_length + 1) * num_strings * substring_length).
A registry complexity expression can use:
alphabet_size ^ substring_length * num_window_choice_product.
The implementation should expose num_window_choice_product or use a conservative expression derived from total_length if product-valued size fields are not desired.
Extra Remark
ClosestString is the special case where all input strings have length ell; then each string has exactly one possible window.
Reduction Rule Crossref
How to solve
Example Instance
Use binary alphabet Sigma = {0,1}, substring length ell = 3, and input strings:
| String |
Value |
Length-3 windows |
s_1 |
00011 |
000, 001, 011 |
s_2 |
10100 |
101, 010, 100 |
s_3 |
11001 |
110, 100, 001 |
Expected Outcome
An optimal center string is:
c = 010.
Choose windows:
- from
s_1, choose 000, distance 1,
- from
s_2, choose 010, distance 0,
- from
s_3, choose 110, distance 1.
So the objective value is 1.
The optimum is not 0, because the three sets of length-3 windows have empty intersection:
s_1 windows: {000, 001, 011}
s_2 windows: {101, 010, 100}
s_3 windows: {110, 100, 001}
Therefore no length-3 center occurs exactly as a substring in all three strings.
BibTeX
@article{LiMaWang2002ClosestStringSubstring,
author = {Ming Li and Bin Ma and Lusheng Wang},
title = {On the Closest String and Substring Problems},
journal = {Journal of the ACM},
volume = {49},
number = {2},
pages = {157--171},
year = {2002},
doi = {10.1145/506147.506150},
url = {https://doi.org/10.1145/506147.506150}
}
Motivation
ClosestSubstringis the window-selection version of the closest-string problem and is not currently in the catalog.It is useful because it models motif discovery: each input string may contain a length-
elloccurrence of a hidden pattern, and the goal is to find a center pattern minimizing the worst Hamming distance to one selected window in each string.Associated rule:
Definition
Name:
ClosestSubstringReference: Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150
Let
Sigma = {0, ..., q-1}be a finite alphabet. The input is:s_1, ..., s_noverSigma, not necessarily of equal length,ell, withell <= |s_i|for every input string.A solution consists of:
c in Sigma^ell,s_i, a start positionp_iselecting the length-ellsubstrings_i[p_i .. p_i + ell).The objective is to minimize:
max_i d_H(c, s_i[p_i .. p_i + ell)).So this is a minimization problem.
Variables
Let
q = alphabet_size,ell = substring_length, andW_i = |s_i| - ell + 1be the number of length-ellwindows in strings_i.ellvariables, each in{0, ..., q-1}p_ihas domain{0, ..., W_i - 1}A configuration is syntactically feasible if the center symbols are in the alphabet and each selected start position is valid. Its objective value is the maximum Hamming distance between the center and the selected windows.
Schema (data type)
Type name:
ClosestSubstringalphabet_sizeqof the finite alphabetstrings{0,...,q-1}s_1,...,s_nsubstring_lengthellSuggested size fields:
alphabet_sizenum_stringssubstring_lengthtotal_length = sum_i |s_i|total_num_windows = sum_i (|s_i| - substring_length + 1)Input validation should require:
alphabet_size > 0unless all relevant lengths are zero,substring_length <= |s_i|for every string,< alphabet_size.Complexity
The direct exact baseline enumerates all center strings and all window choices.
This gives:
O(alphabet_size^substring_length * product_i(|s_i| - substring_length + 1) * num_strings * substring_length).A registry complexity expression can use:
alphabet_size ^ substring_length * num_window_choice_product.The implementation should expose
num_window_choice_productor use a conservative expression derived fromtotal_lengthif product-valued size fields are not desired.Extra Remark
ClosestStringis the special case where all input strings have lengthell; then each string has exactly one possible window.Reduction Rule Crossref
How to solve
ILP/i32via [Rule] ClosestSubstring to ILP #1035.Example Instance
Use binary alphabet
Sigma = {0,1}, substring lengthell = 3, and input strings:s_100011000,001,011s_210100101,010,100s_311001110,100,001Expected Outcome
An optimal center string is:
c = 010.Choose windows:
s_1, choose000, distance1,s_2, choose010, distance0,s_3, choose110, distance1.So the objective value is
1.The optimum is not
0, because the three sets of length-3 windows have empty intersection:s_1windows:{000, 001, 011}s_2windows:{101, 010, 100}s_3windows:{110, 100, 001}Therefore no length-3 center occurs exactly as a substring in all three strings.
BibTeX