Skip to content

pyDock/pyProCT

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

399 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

pyProCT (Python 3 fork)

pyProCT is a clustering framework designed to analyze large ensembles of protein conformations, with a strong focus on proteinโ€“protein docking and structural similarity clustering.

This repository is a Python 3 compatible fork of the original pyProCT project, preserving its original philosophy while updating the codebase to work with modern Python, NumPy, SciPy, and Cython.

Migration and validation details are tracked in docs/PYTHON3_MIGRATION.md.


1. What is pyProCT?

pyProCT is a modular framework that:

  • Computes distance matrices between structures (typically L-RMSD)
  • Applies multiple clustering algorithms
  • Evaluates cluster quality using different metrics
  • Selects the best clustering automatically
  • Generates postprocessing outputs (clusters, representatives, statistics)

Originally designed for docking decoy analysis, it is still very well suited for that purpose.


2. Important differences from the original pyProCT

โš ๏ธ This fork is not a drop-in replacement of the original repository.

Key differences:

  • โœ… Python 3.9+ compatible

  • โŒ External pyRMSD is no longer a mandatory dependency; this fork includes a local compatibility wrapper for the API used by pyProCT

  • ๐Ÿ”ง Scheduler, analysis pipeline and postprocessing loader were fixed

  • ๐Ÿง  Cython extensions were updated and recompiled:

    • DBSCAN
    • Spectral clustering
  • ๐Ÿงฎ NumPy deprecations fixed:

    • np.float โ†’ float / np.float64
    • np.int โ†’ int / np.int64
  • ๐Ÿ“ SciPy API updated:

    • eigvals โ†’ subset_by_index
  • ๐Ÿ“„ JSON schemas slightly clarified (parameter names matter)

The goal of this fork is functionality and reproducibility, not feature expansion.


3. Installation (Python 3)

Clone the repository first:

git clone https://github.com/pyDock/pyProCT.git
cd pyProCT

Installing pyProCT only

This environment targets pyProCT alone. It pins the packages validated in the standalone Python 3 migration environment; some scientific packages are installed from PyPI wheels to reproduce those exact versions.

conda env create -f environment-pyproct.yml
conda activate pyproct

Install pyProCT in editable mode and run the test suite. The editable install builds the DBSCAN, spectral and metric Cython extensions.

python -m pip install -e .
python -m unittest discover pyproct -p 'Test*.py'

PYTHONNOUSERSITE=1 is also defined in the YAML to avoid importing packages from ~/.local.

Expected result:

Ran 232 tests
OK (skipped=32)

Installing the combined PyDock4 + pyProCT environment

This environment preserves the PyDock4 scientific pins while adding the dependencies needed by pyProCT.

conda env create -f environment-pydock4-pyproct.yml
conda activate pydock4-pyproct

Install pyProCT in editable mode and run the test suite. The editable install builds the DBSCAN, spectral and metric Cython extensions.

python -m pip install -e .
python -m unittest discover pyproct -p 'Test*.py'

The combined environment keeps these PyDock4 pins:

numpy=1.23.5
scipy=1.15.2
pandas=1.5.3
biopython=1.85
cython=3.1.2
setuptools=59.8.0

fastcluster=1.2.6 is pinned intentionally to avoid NumPy 2 builds that are incompatible with the PyDock4 numpy=1.23.5 pin. The combined environment was validated with the pyProCT test suite and validation/bidimensional.

Expected result:

Ran 232 tests
OK (skipped=32)

Run validation/bidimensional only from a temporary copy, not from the original validation folder. See docs/PYTHON3_MIGRATION.md for the full validation workflow and baseline values.


4. Quick start

python -m pyproct.main config.json

Where config.json defines:

  • input structures
  • clustering algorithms
  • evaluation criteria
  • postprocessing actions

5. Distance matrices

pyProCT typically works with condensed distance matrices (as in SciPy).

In docking applications, distances usually represent L-RMSD (ร…).

Typical observed ranges:

min โ‰ˆ 0.7 ร…
median โ‰ˆ 50 ร…
p95 โ‰ˆ 80 ร…
max โ‰ˆ 85 ร…

This scale is important when choosing clustering parameters.


6. Clustering algorithms

โœ… Supported and tested algorithms

Algorithm Status Notes
gromos โœ… Stable Recommended for docking
dbscan โœ… Stable Parameter sensitive
kmedoids โœ… Stable Requires K
hierarchical โœ… Stable Cutoff critical
spectral โœ… Stable Computationally expensive
random โš ๏ธ Baseline For comparison only

6.1 GROMOS (recommended)

"gromos": {
  "parameters": [
    { "cutoff": 4.0 },
    { "cutoff": 6.0 },
    { "cutoff": 8.0 }
  ]
}
  • cutoff = maximum RMSD (ร…) to consider two structures neighbors

  • Typical values:

    • 2โ€“4 ร…: very strict
    • 6โ€“8 ร…: flexible docking

6.2 DBSCAN

"dbscan": {
  "parameters": [
    { "eps": 10.0, "minpts": 2 },
    { "eps": 15.0, "minpts": 2 },
    { "eps": 20.0, "minpts": 2 }
  ]
}

Interpretation (important):

  • eps = maximum RMSD distance (ร…)
  • minpts = minimum number of neighbors to form a cluster

If eps is too small โ†’ 0 clusters If eps is large โ†’ fewer, larger clusters


6.3 K-Medoids

"kmedoids": {
  "parameters": [
    { "k": 5 },
    { "k": 10 },
    { "k": 20 }
  ]
}
  • Requires knowing approximately how many clusters you expect
  • Very stable algorithm

6.4 Hierarchical clustering

"hierarchical": {
  "parameters": [
    { "method": "average", "cutoff": 6.0 },
    { "method": "average", "cutoff": 8.0 },
    { "method": "average", "cutoff": 10.0 }
  ]
}

Notes:

  • average is usually better than complete for RMSD
  • Very sensitive to cutoff
  • Can generate many singletons if cutoff is small

6.5 Spectral clustering

"spectral": {
  "parameters": [
    { "max_clusters": 10 },
    { "max_clusters": 20 }
  ],
  "force_sparse": false
}
  • More expensive than other methods
  • Useful for non-convex cluster shapes
  • Requires well-scaled distance matrices

6.6 Random (baseline)

"random": {
  "parameters": [
    { "num_of_clusters": 2 },
    { "num_of_clusters": 5 }
  ]
}
  • Not a real clustering algorithm
  • Useful as a baseline for evaluation metrics

7. Evaluation criteria

For docking applications, Silhouette and Cohesion are the most informative.

Example:

"evaluation": {
  "evaluation_criteria": {
    "criteria_0": {
      "Silhouette": {
        "action": ">",
        "weight": 1
      }
    }
  },
  "maximum_noise": 30,
  "minimum_cluster_size": 1
}

Notes:

  • Silhouette can be NaN for 1-cluster solutions (this is expected)
  • Some algorithms may generate valid clusterings that are later rejected by evaluation filters

8. Postprocessing actions (KEYWORD list)

Valid postprocessing actions:

KEYWORD Description
representatives representative structures
clusters PDB files per cluster
cluster_stats per-cluster statistics
rmsf RMSF per cluster
centers_and_trace cluster centers and trajectories
compression redundancy elimination

โš ๏ธ Note: pdb_clusters was replaced by clusters.


9. Known limitations

  • DBSCAN may legitimately return zero clusters for some parameters
  • Hierarchical clustering can generate many singletons
  • Spectral clustering is sensitive to matrix scaling
  • Random clustering is not meaningful scientifically
  • Not all โ€œImproductive clustering searchโ€ messages indicate a bug

10. Recommended workflow for docking

  1. Start with GROMOS
  2. Add DBSCAN with increasing eps
  3. Use Silhouette as main selection criterion
  4. Inspect cluster representatives visually
  5. Use hierarchical only for exploratory analysis

11. Citation

Original pyProCT paper:

If you plan to use pyProCT or any of its parts, including its documentation, to write a scientific article, please consider to add the following cite:
J. Chem. Theory Comput., 2014, 10 (8), pp 3236โ€“3243

This fork provides Python 3 compatibility and maintenance fixes, but does not change the scientific methodology.

About

๐Ÿ”ฌ pyProCT (Python 3 fork) brings the original pyProCT clustering toolkit back to life under modern Python. โš™๏ธ This version fixes Python 2 legacy issues, updates Cython extensions, adapts to recent NumPy/SciPy APIs, and ensures all clustering algorithms and postprocessing actions work end-to-end.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.9%
  • Cython 1.1%