Fast visualisation of the population structure of pathogens using Stochastic Cluster Embedding.
Paper:
Lees JA, Tonkin-Hill G, Yang Z, Corander J. Mandrake: visualizing microbial population structure by embedding millions of genomes into a low-dimensional representation. Philosophical Transactions of The Royal Society B. 2022;377: 20210237.
https://doi.org/10.1098/rstb.2021.0237
Documentation available at: https://mandrake.readthedocs.io/en/latest/
See https://mandrake.readthedocs.io/en/latest/installation.html for more details.
- Install miniconda.
- Run
conda create -n mandrake_env mandraketo install into a clean environment. - Run
conda activate mandrake_envto use the environment.
Refer to the conda-forge documentation if you want to install a CUDA (GPU) enabled version.
You will need some dependencies, which you can install through conda:
conda create -n mandrake_env python
conda env update -n mandrake_env --file environment.yml
conda activate mandrake_env
You can then clone this repository, and run:
python setup.py install
You will need the CUDA toolkit installed.
If you have the ability to compile CUDA (e.g. nvcc) you should see a message:
CUDA found, compiling both GPU and CPU code
otherwise only the CPU version will be compiled:
CUDA not found, compiling CPU code only
After installing, an example command would look like this:
mandrake --sketches sketchlib.h5 --kNN 500 --cpus 4 --maxIter 1000000
This would use a file sketchlib.h5 created by pp-sketchlib
to calculate accessory distances using 500 nearest neighbours.
Output can be found in numerous files prefixed mandrake.embedding*.
Other useful arguments include:
--alignmentuse a fasta alignment to calculate distances--accessoryuse a presence/absence file (Rtab or similar) to calculate distances--distancesuse a.npzfile from a previous run and skip straight to the embedding step--labelsgive labels to colour the output by--perplexitychange the perplexity of the preprocessing (similar to t-SNE)--animateproduce a video of the optimisation--use-gpuuse a GPU for the run. Make sure to increase--n-workers.
See the documentation for more details.
