Skip to content

[Model] ClosestSubstring #1033

@zhangjieqingithub

Description

@zhangjieqingithub

Motivation

ClosestSubstring is the window-selection version of the closest-string problem and is not currently in the catalog.

It is useful because it models motif discovery: each input string may contain a length-ell occurrence of a hidden pattern, and the goal is to find a center pattern minimizing the worst Hamming distance to one selected window in each string.

Associated rule:

Definition

Name: ClosestSubstring
Reference: Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150

Let Sigma = {0, ..., q-1} be a finite alphabet. The input is:

  • strings s_1, ..., s_n over Sigma, not necessarily of equal length,
  • a substring length ell, with ell <= |s_i| for every input string.

A solution consists of:

  • a center string c in Sigma^ell,
  • for each input string s_i, a start position p_i selecting the length-ell substring s_i[p_i .. p_i + ell).

The objective is to minimize:

max_i d_H(c, s_i[p_i .. p_i + ell)).

So this is a minimization problem.

Variables

Let q = alphabet_size, ell = substring_length, and W_i = |s_i| - ell + 1 be the number of length-ell windows in string s_i.

  • Center variables: ell variables, each in {0, ..., q-1}
  • Window variables: one start-position variable per input string, where variable p_i has domain {0, ..., W_i - 1}

A configuration is syntactically feasible if the center symbols are in the alphabet and each selected start position is valid. Its objective value is the maximum Hamming distance between the center and the selected windows.

Schema (data type)

Type name: ClosestSubstring

Field Mathematical type Description
alphabet_size positive integer Size q of the finite alphabet
strings list of strings over {0,...,q-1} Input strings s_1,...,s_n
substring_length nonnegative integer Window length ell

Suggested size fields:

  • alphabet_size
  • num_strings
  • substring_length
  • total_length = sum_i |s_i|
  • total_num_windows = sum_i (|s_i| - substring_length + 1)

Input validation should require:

  • alphabet_size > 0 unless all relevant lengths are zero,
  • substring_length <= |s_i| for every string,
  • every symbol is < alphabet_size.

Complexity

The direct exact baseline enumerates all center strings and all window choices.

This gives:

O(alphabet_size^substring_length * product_i(|s_i| - substring_length + 1) * num_strings * substring_length).

A registry complexity expression can use:

alphabet_size ^ substring_length * num_window_choice_product.

The implementation should expose num_window_choice_product or use a conservative expression derived from total_length if product-valued size fields are not desired.

Extra Remark

ClosestString is the special case where all input strings have length ell; then each string has exactly one possible window.

Reduction Rule Crossref

How to solve

Example Instance

Use binary alphabet Sigma = {0,1}, substring length ell = 3, and input strings:

String Value Length-3 windows
s_1 00011 000, 001, 011
s_2 10100 101, 010, 100
s_3 11001 110, 100, 001

Expected Outcome

An optimal center string is:

c = 010.

Choose windows:

  • from s_1, choose 000, distance 1,
  • from s_2, choose 010, distance 0,
  • from s_3, choose 110, distance 1.

So the objective value is 1.

The optimum is not 0, because the three sets of length-3 windows have empty intersection:

  • s_1 windows: {000, 001, 011}
  • s_2 windows: {101, 010, 100}
  • s_3 windows: {110, 100, 001}

Therefore no length-3 center occurs exactly as a substring in all three strings.

BibTeX

@article{LiMaWang2002ClosestStringSubstring,
  author  = {Ming Li and Bin Ma and Lusheng Wang},
  title   = {On the Closest String and Substring Problems},
  journal = {Journal of the ACM},
  volume  = {49},
  number  = {2},
  pages   = {157--171},
  year    = {2002},
  doi     = {10.1145/506147.506150},
  url     = {https://doi.org/10.1145/506147.506150}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    modelA model problem to be implemented.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions