-
Notifications
You must be signed in to change notification settings - Fork 6
Implement reduce_dimensions #932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
2a02952
adding reduce_dimensions outline
Graciaaa3 1bd1eda
clean up structure for umap, currently working
Graciaaa3 53dfc76
removing redundant reduction_results parameter
Graciaaa3 7ec74a2
update umap test to use reduce_dimensions
Graciaaa3 d8d4840
Revert "update umap test to use reduce_dimensions"
Graciaaa3 0debc50
update model path to be algorithm specific
Graciaaa3 7cda6c4
adding reduce_dimensions tests
Graciaaa3 51b4345
Merge branch 'main' into reduce_dimensions
Graciaaa3 0c2704c
adding pca implementation
Graciaaa3 2a281ca
updating reduce_dimensions tests
Graciaaa3 7d83119
update tests and load_model for pca to use correct attributes
Graciaaa3 c395411
adding tsne implementation
Graciaaa3 5227c62
factor helper methods to base class and update docstring
Graciaaa3 8a85644
factoring model validate helper function from load_model
Graciaaa3 2bd7f25
adding tsne test
Graciaaa3 0621c94
Merge branch 'main' into reduce_dimensions
Graciaaa3 9c07056
docstring fix for pca and tsne
Graciaaa3 522a042
deprecating old umap verb
Graciaaa3 a3f9704
moving 'parallel' config keyword to upper level
Graciaaa3 627ca1c
adding config migration 004 and tests
Graciaaa3 8ffcb8f
Merge branch 'main' into reduce_dimensions
Graciaaa3 3f36adc
format fix
Graciaaa3 84b8658
Merge branch 'main' into reduce_dimensions
Graciaaa3 dcce2f0
docstring and warning message fix
Graciaaa3 64342ab
comment and test fix
Graciaaa3 6dcacbb
Merge branch 'main' into reduce_dimensions
Graciaaa3 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
77 changes: 77 additions & 0 deletions
77
src/hyrax/config_migrations/migrations/004_move_umap_to_reduce.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| """Config migration: version 4 → version 5. | ||
|
|
||
| Move the legacy ``[umap]`` and ``[umap.UMAP]`` to be under ``[reduce]`` table | ||
| with ``[reduce.umap]`` and ``[reduce.umap.kwargs]``. | ||
| """ | ||
|
|
||
| import tomlkit | ||
| from tomlkit.toml_document import TOMLDocument | ||
|
|
||
| from hyrax.config_migrations.migration_utils import migration_step, move_key | ||
|
|
||
|
|
||
| @migration_step( | ||
| from_version=4, | ||
| key_renames={ | ||
| "umap.fit_sample_size": "reduce.umap.fit_sample_size", | ||
| "umap.model_path": "reduce.umap.model_path", | ||
| "umap.save_fit_umap": "reduce.save_fit_model", | ||
| "umap.parallel": "reduce.parallel", | ||
| "umap.UMAP": "reduce.umap.kwargs", | ||
| }, | ||
| ) | ||
| def move_umap_to_reduce(cfg: TOMLDocument) -> TOMLDocument: | ||
| """Move the legacy ``[umap]`` and ``[umap.UMAP]`` to be under ``[reduce]``.""" | ||
| # Moving umap sections | ||
| umap_tbl = cfg.get("umap") | ||
| if not umap_tbl: | ||
| return cfg | ||
|
|
||
| # Ensure [reduce] exists | ||
| reduce_tbl = cfg.get("reduce") | ||
| if reduce_tbl is None: | ||
| reduce_tbl = tomlkit.table() | ||
| cfg["reduce"] = reduce_tbl | ||
|
|
||
| # Ensure [reduce.umap] exists | ||
| umap_reduce = reduce_tbl.get("umap") | ||
| if umap_reduce is None: | ||
| umap_reduce = tomlkit.table() | ||
| reduce_tbl["umap"] = umap_reduce | ||
|
|
||
| # under [reduce.umap] | ||
| move_key(cfg, "umap.fit_sample_size", "reduce.umap.fit_sample_size") | ||
| move_key(cfg, "umap.model_path", "reduce.umap.model_path") | ||
|
|
||
| # under [reduce] | ||
| reduce_tbl["batch_size"] = 1024 | ||
| move_key(cfg, "umap.save_fit_umap", "reduce.save_fit_model") | ||
| move_key(cfg, "umap.parallel", "reduce.parallel") | ||
| if "name" in umap_tbl and umap_tbl["name"] == "umap.UMAP": | ||
| reduce_tbl["algorithm"] = "umap" | ||
|
|
||
| # Move umap.UMAP kwargs to reduce.umap.kwargs | ||
| move_key(cfg, "umap.UMAP", "reduce.umap.kwargs") | ||
|
|
||
| # Delete the old umap section | ||
| del cfg["umap"] | ||
|
|
||
| # Adding tsne section | ||
| reduce_tbl["tsne"] = tomlkit.table() | ||
|
|
||
| reduce_tbl["tsne"]["kwargs"] = tomlkit.table() | ||
| reduce_tbl["tsne"]["kwargs"]["n_components"] = 2 | ||
| reduce_tbl["tsne"]["kwargs"]["perplexity"] = 30.0 | ||
|
|
||
| # Adding pca section | ||
| reduce_tbl["pca"] = tomlkit.table() | ||
| reduce_tbl["pca"]["fit_sample_size"] = 1024 | ||
| reduce_tbl["pca"]["model_path"] = False | ||
|
|
||
| reduce_tbl["pca"]["kwargs"] = tomlkit.table() | ||
| reduce_tbl["pca"]["kwargs"]["n_components"] = 2 | ||
|
|
||
| if len(reduce_tbl): | ||
| cfg["reduce"] = reduce_tbl | ||
|
|
||
| return cfg | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
Graciaaa3 marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,167 @@ | ||
| import gc | ||
| import logging | ||
| import warnings | ||
| from argparse import ArgumentParser, Namespace | ||
| from pathlib import Path | ||
| from typing import Union | ||
|
|
||
| import numpy as np | ||
|
|
||
| from .verb_registry import Verb, hyrax_verb | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| @hyrax_verb | ||
| class ReduceDimensions(Verb): | ||
| """Verb to reduce the dimensionality of a dataset""" | ||
|
|
||
| # Use an attribute-friendly name so `hyrax.reduce_dimensions` resolves. | ||
| cli_name = "reduce_dimensions" | ||
| add_parser_kwargs = {} | ||
| description = "Reduce the dimensionality of a dataset using provided or default reduction algorithm." | ||
|
|
||
| @staticmethod | ||
| def setup_parser(parser: ArgumentParser): | ||
| """Setup parser for reduce-dimensions verb""" | ||
| parser.add_argument( | ||
| "-a", | ||
| "--algorithm", | ||
| type=str, | ||
| required=False, | ||
| help="Dimensionality reduction algorithm to use (default: umap).", | ||
| ) | ||
| parser.add_argument( | ||
| "-i", | ||
| "--input-dir", | ||
| type=str, | ||
| required=False, | ||
| help="Directory containing the dataset to reduce dimensions for.", | ||
| ) | ||
| parser.add_argument( | ||
| "-m", | ||
| "--model-path", | ||
| type=str, | ||
| required=False, | ||
| help="Path to a previously saved reducer model.", | ||
| ) | ||
|
|
||
| def run_cli(self, args: Namespace | None = None): | ||
| """CLI stub for ReduceDimensions verb""" | ||
| logger.info("`reduce-dimensions` run from CLI.") | ||
|
|
||
| if args is None: | ||
| raise RuntimeError("Run CLI called with no arguments.") | ||
|
|
||
| return self.run(algorithm=args.algorithm, input_dir=args.input_dir, model_path=args.model_path) | ||
|
|
||
| def run( | ||
| self, | ||
| algorithm: str | None = None, | ||
| input_dir: Union[Path, str] | None = None, | ||
| model_path: Union[Path, str] | None = None, | ||
| ): | ||
| """ | ||
| Run dimensionality reduction on a dataset | ||
|
|
||
| This method loads the latent space representations from an inference run and applies | ||
| the selected dimensionality reduction algorithm. | ||
|
|
||
| Algorithms that support reusable fitted models may either: | ||
|
|
||
| - fit a new model using a sampled subset of the data, or | ||
| - load an existing model if a model path is provided. | ||
|
|
||
| Algorithms without a separate fitting stage do not support model loading and | ||
| directly transform the input data. | ||
|
|
||
| The full dataset is then transformed into the target lower-dimensional space, | ||
| and the resulting embeddings are saved. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| algorithm : str, Optional | ||
| The dimensionality reduction algorithm to use. | ||
| If not specified, the method will look in the config for a default algorithm. | ||
|
|
||
| input_dir : str or Path, Optional | ||
| Directory containing the dataset to reduce dimensions for. | ||
|
|
||
| model_path : str or Path, Optional | ||
| Path to a previously saved reducer model. | ||
|
|
||
| Returns | ||
| ------- | ||
| None | ||
| The method does not return anything but saves the algorithm reducer representations to disk. | ||
| """ | ||
| with warnings.catch_warnings(): | ||
| warnings.simplefilter(action="ignore", category=FutureWarning) | ||
| return self._run(algorithm, input_dir, model_path) | ||
|
|
||
| def _run( | ||
| self, algorithm: str | None, input_dir: Union[Path, str] | None, model_path: Union[Path, str] | None | ||
| ): | ||
| """See run()""" | ||
| from hyrax.config_utils import create_results_dir | ||
| from hyrax.datasets.result_factories import create_results_writer, load_results_dataset | ||
| from hyrax.verbs.reduction_algorithms.algorithm_registry import fetch_reducer_class | ||
|
|
||
| # Get reducer class | ||
| algorithm_name = algorithm or self.config["reduce"]["algorithm"] | ||
| reducer_cls = fetch_reducer_class(algorithm_name) | ||
|
|
||
| results_dir = create_results_dir(self.config, f"{algorithm_name}") | ||
| logger.info(f"Saving reduction results using {algorithm_name} to {results_dir}") | ||
| reduction_results = create_results_writer(results_dir) | ||
|
|
||
| algo_reducer = reducer_cls(self.config, reduction_results) | ||
|
|
||
| inference_results = load_results_dataset(self.config, results_dir=input_dir, verb="infer") | ||
| total_length = len(inference_results) | ||
|
|
||
| # Prepare data sample for either fitting a new model or validating a pre-trained model loaded. | ||
| config_sample_size = self.config["reduce"][algorithm_name].get("fit_sample_size", None) | ||
| sample_size = int(np.min([config_sample_size if config_sample_size else np.inf, total_length])) | ||
| rng = np.random.default_rng() | ||
| sample_indexes = rng.choice(np.arange(total_length), size=sample_size, replace=False) | ||
| data_sample = np.asarray(inference_results[sample_indexes]).reshape((sample_size, -1)) | ||
|
|
||
| # Load model if path provided, otherwise fit new model | ||
| # Getting the model of current algorithm specified. | ||
| if model_path is None: | ||
| model_path = self.config["reduce"][algorithm_name].get("model_path", None) | ||
|
|
||
| if model_path: | ||
| logger.info(f"Loading pre-existing reducer model from {model_path}") | ||
| algo_reducer.load_model(data_sample.shape[1], model_path) | ||
| else: | ||
| logger.info("No model_path specified. A new model will be fitted.") | ||
| algo_reducer.fit(data_sample) | ||
|
|
||
| if self.config["reduce"].get("save_fit_model", False): | ||
| logger.info(f"Saving fitted {algorithm_name} reducer to result directory") | ||
| algo_reducer.save_model(results_dir) | ||
|
|
||
| del data_sample | ||
| gc.collect() | ||
|
|
||
| # Transform dataset | ||
| batch_size = self.config["reduce"]["batch_size"] | ||
| num_batches = int(np.ceil(total_length / batch_size)) | ||
|
|
||
| all_indexes = np.arange(0, total_length) | ||
| all_ids = np.array(inference_results.ids()) | ||
|
|
||
| args = ( | ||
| ( | ||
| all_ids[batch_indexes], | ||
| inference_results[batch_indexes].reshape(len(batch_indexes), -1), | ||
| ) | ||
| for batch_indexes in np.array_split(all_indexes, num_batches) | ||
| ) | ||
| algo_reducer.transform(args, num_batches) | ||
|
|
||
| logger.info(f"Finished transforming all data with {algorithm_name}") | ||
|
|
||
| return load_results_dataset(self.config, results_dir) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| # Remove import sorting, these are imported in the order written so that | ||
| # autoapi docs are generated with ordering controlled below. | ||
| # ruff: noqa: I001 | ||
| from .algorithm_registry import ReductionAlgorithm | ||
| from .umap import UMAP | ||
| from .pca import PCA | ||
| from .tsne import TSNE | ||
|
|
||
| __all__ = [ | ||
| "ReductionAlgorithm", | ||
| "UMAP", | ||
| "PCA", | ||
| "TSNE", | ||
| ] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.