pyProCT is a clustering framework designed to analyze large ensembles of protein conformations, with a strong focus on proteinโprotein docking and structural similarity clustering.
This repository is a Python 3 compatible fork of the original pyProCT project, preserving its original philosophy while updating the codebase to work with modern Python, NumPy, SciPy, and Cython.
Migration and validation details are tracked in docs/PYTHON3_MIGRATION.md.
pyProCT is a modular framework that:
- Computes distance matrices between structures (typically L-RMSD)
- Applies multiple clustering algorithms
- Evaluates cluster quality using different metrics
- Selects the best clustering automatically
- Generates postprocessing outputs (clusters, representatives, statistics)
Originally designed for docking decoy analysis, it is still very well suited for that purpose.
Key differences:
-
โ Python 3.9+ compatible
-
โ External
pyRMSDis no longer a mandatory dependency; this fork includes a local compatibility wrapper for the API used by pyProCT -
๐ง Scheduler, analysis pipeline and postprocessing loader were fixed
-
๐ง Cython extensions were updated and recompiled:
- DBSCAN
- Spectral clustering
-
๐งฎ NumPy deprecations fixed:
np.floatโfloat / np.float64np.intโint / np.int64
-
๐ SciPy API updated:
eigvalsโsubset_by_index
-
๐ JSON schemas slightly clarified (parameter names matter)
The goal of this fork is functionality and reproducibility, not feature expansion.
Clone the repository first:
git clone https://github.com/pyDock/pyProCT.git
cd pyProCTThis environment targets pyProCT alone. It pins the packages validated in the standalone Python 3 migration environment; some scientific packages are installed from PyPI wheels to reproduce those exact versions.
conda env create -f environment-pyproct.yml
conda activate pyproctInstall pyProCT in editable mode and run the test suite. The editable install builds the DBSCAN, spectral and metric Cython extensions.
python -m pip install -e .
python -m unittest discover pyproct -p 'Test*.py'PYTHONNOUSERSITE=1 is also defined in the YAML to avoid importing packages
from ~/.local.
Expected result:
Ran 232 tests
OK (skipped=32)
This environment preserves the PyDock4 scientific pins while adding the dependencies needed by pyProCT.
conda env create -f environment-pydock4-pyproct.yml
conda activate pydock4-pyproctInstall pyProCT in editable mode and run the test suite. The editable install builds the DBSCAN, spectral and metric Cython extensions.
python -m pip install -e .
python -m unittest discover pyproct -p 'Test*.py'The combined environment keeps these PyDock4 pins:
numpy=1.23.5
scipy=1.15.2
pandas=1.5.3
biopython=1.85
cython=3.1.2
setuptools=59.8.0
fastcluster=1.2.6 is pinned intentionally to avoid NumPy 2 builds that are
incompatible with the PyDock4 numpy=1.23.5 pin. The combined environment was
validated with the pyProCT test suite and validation/bidimensional.
Expected result:
Ran 232 tests
OK (skipped=32)
Run validation/bidimensional only from a temporary copy, not from the original
validation folder. See docs/PYTHON3_MIGRATION.md
for the full validation workflow and baseline values.
python -m pyproct.main config.jsonWhere config.json defines:
- input structures
- clustering algorithms
- evaluation criteria
- postprocessing actions
pyProCT typically works with condensed distance matrices (as in SciPy).
In docking applications, distances usually represent L-RMSD (ร ).
Typical observed ranges:
min โ 0.7 ร
median โ 50 ร
p95 โ 80 ร
max โ 85 ร
This scale is important when choosing clustering parameters.
| Algorithm | Status | Notes |
|---|---|---|
| gromos | โ Stable | Recommended for docking |
| dbscan | โ Stable | Parameter sensitive |
| kmedoids | โ Stable | Requires K |
| hierarchical | โ Stable | Cutoff critical |
| spectral | โ Stable | Computationally expensive |
| random | For comparison only |
"gromos": {
"parameters": [
{ "cutoff": 4.0 },
{ "cutoff": 6.0 },
{ "cutoff": 8.0 }
]
}-
cutoff= maximum RMSD (ร ) to consider two structures neighbors -
Typical values:
- 2โ4 ร : very strict
- 6โ8 ร : flexible docking
"dbscan": {
"parameters": [
{ "eps": 10.0, "minpts": 2 },
{ "eps": 15.0, "minpts": 2 },
{ "eps": 20.0, "minpts": 2 }
]
}Interpretation (important):
eps= maximum RMSD distance (ร )minpts= minimum number of neighbors to form a cluster
If eps is too small โ 0 clusters
If eps is large โ fewer, larger clusters
"kmedoids": {
"parameters": [
{ "k": 5 },
{ "k": 10 },
{ "k": 20 }
]
}- Requires knowing approximately how many clusters you expect
- Very stable algorithm
"hierarchical": {
"parameters": [
{ "method": "average", "cutoff": 6.0 },
{ "method": "average", "cutoff": 8.0 },
{ "method": "average", "cutoff": 10.0 }
]
}Notes:
averageis usually better thancompletefor RMSD- Very sensitive to
cutoff - Can generate many singletons if cutoff is small
"spectral": {
"parameters": [
{ "max_clusters": 10 },
{ "max_clusters": 20 }
],
"force_sparse": false
}- More expensive than other methods
- Useful for non-convex cluster shapes
- Requires well-scaled distance matrices
"random": {
"parameters": [
{ "num_of_clusters": 2 },
{ "num_of_clusters": 5 }
]
}- Not a real clustering algorithm
- Useful as a baseline for evaluation metrics
For docking applications, Silhouette and Cohesion are the most informative.
Example:
"evaluation": {
"evaluation_criteria": {
"criteria_0": {
"Silhouette": {
"action": ">",
"weight": 1
}
}
},
"maximum_noise": 30,
"minimum_cluster_size": 1
}Notes:
- Silhouette can be
NaNfor 1-cluster solutions (this is expected) - Some algorithms may generate valid clusterings that are later rejected by evaluation filters
Valid postprocessing actions:
| KEYWORD | Description |
|---|---|
| representatives | representative structures |
| clusters | PDB files per cluster |
| cluster_stats | per-cluster statistics |
| rmsf | RMSF per cluster |
| centers_and_trace | cluster centers and trajectories |
| compression | redundancy elimination |
pdb_clusters was replaced by clusters.
- DBSCAN may legitimately return zero clusters for some parameters
- Hierarchical clustering can generate many singletons
- Spectral clustering is sensitive to matrix scaling
- Random clustering is not meaningful scientifically
- Not all โImproductive clustering searchโ messages indicate a bug
- Start with GROMOS
- Add DBSCAN with increasing
eps - Use Silhouette as main selection criterion
- Inspect cluster representatives visually
- Use hierarchical only for exploratory analysis
Original pyProCT paper:
If you plan to use pyProCT or any of its parts, including its documentation, to write a scientific article,
please consider to add the following cite:
J. Chem. Theory Comput., 2014, 10 (8), pp 3236โ3243
This fork provides Python 3 compatibility and maintenance fixes, but does not change the scientific methodology.