Recipe wish list

Since the profiling-recipe was finalized for JUMP the number of people who have interacted with the recipe has dramatically increased (JUMP members and the image analysts). Given that the recipe was always written and rewritten to satisfy the needs of the JUMP pilot experiments, I am surprised that it is robust and has, so far, not failed catastrophically. That doesn't mean that code is perfect. It needs a lot of work, particularly in

- code documentation
- coding consistency so that it doesn't look like a frankencode

I will work on tidying up the code so that it is easier for others to read and contribute to the codebase.

Apart from the above, other changes also need to be made to the recipe because there have been several feature requests both from before and after the version for JUMP was frozen. These feature requests lie in the spectrum of requiring minor changes to the recipe to requiring major changes to recipe and pycytominer. 

I have listed all the feature requests below, with some comments and a score for how easy or difficult, it will be to implement these (1 is easy and requires the least amount of time; 5 is difficult and requires the most amount of time). 

**Add feature analysis**
*Difficulty: 3*
- Shantanu has written [a script](https://github.com/jump-cellpainting/pilot-cpjump1-data/blob/main/visualize/1.feature-analysis.Rmd) to [visualize](https://github.com/jump-cellpainting/pilot-cpjump1-data/blob/main/visualize/knit_notebooks/1.feature-analysis.nb.html) how different categories of features vary for perturbations and DMSO on a given plate. The script is written R and will need to be rewritten for python.

**Sample images**
*Difficulty: 3*
- Shantanu has written [a script](https://github.com/jump-cellpainting/pilot-cpjump1-data/blob/main/visualize/4.display-image-demo.Rmd) to generate [thumbnail montages](https://github.com/jump-cellpainting/pilot-cpjump1-data/blob/main/visualize/knit_notebooks/4.display-image-demo.md) of perturbations. Creating a montage for each well in a plate while running the workflow is valuable as it will help us answer our most asked question - What does the cell look like.

**Calculate Replicate correlation and Percent Replicating**
*Difficulty: 2*
- We currently calculate the correlation between every pair of wells during the `quality control > heatmap`  step of the recipe. In order to calculate replicate correlation and `Percent Replicating` the recipe would need to know which metadata column identifies the replicates. This could be added to the `quality_control` step.

Beth mentioned this in https://github.com/cytomining/profiling-recipe/issues/29

**Rename quality_control**
*Difficulty: 1*
- This block was named so because we wanted to generate plots that would tell us if there is something wrong with a plate. Now that I want to include other plots and analyses, I think we should rename this block. I don't have a new name, but I will think of one once we decide all the new plots and analyses that will go under this block.

**Adding second order features**
*Difficulty: 5*
- We wanted to include this for JUMP, but due to the lack of time, we decided not to. IIUC, this would require changes to pycytominer which would mean it won't be easy to implement.


**Adding dispersion order features**
*Difficulty: 2*
- We wanted to include this for JUMP, but we ran out of time

**Adding replicate correlation feature selection as an option**
*Difficulty: 4*
- This would also require changes to pycytominer.
More details - https://github.com/jump-cellpainting/develop-computational-pipeline/issues/38#issuecomment-793067370

**Adding git, aws cli to the conda environment**
*Difficulty: 1*
Given that all the packages are installed using conda, it makes sense to add git and aws cli via conda as well. This is particularly helpful with ec2 instances that use outdated versions of git.

**Set summary -> perform false**
*Difficulty: 1*
I realized that not all scopes generate load_data_csv files, which is required for the summary file to be generated. Hence, the default option in the config file for `perform` should be `false`.

**Automatically create the plate information in the config files**
*Difficulty:4*
One of the most cumbersome tasks while running the recipe is to specify the names of all the batches and plates in the config file. If a user wants to run all the plates using a single config file, this information is already available in the `barcode_platemap.csv` file and could be added automatically to the config file. But the tricky part is making the script generic such that it can satisfy most users' needs.

**Replace `find` and `rsync` steps**
*Difficulty: 2*
Currently, these two steps are necessary when aggregation is performed outside the recipe. These two steps compress the well-level aggregated profiles and then copy them to the `profiles` folder. This could be implemented in the recipe, saving the user the hassle of running these two steps.

**Remove features = infer from normalize and feature select**
*Difficulty: 1*
This option exists so that the user can input their list of features instead of letting pycytominer infer the feature for the profiles. I don't see any user inputting thousands of feature names in the config file. I will remove this option from the config file and if a user wants to use their set of features, they can call pycytominer using their own script.

**Profile annotation at the plate level**
*Difficulty: 3*
When multiple types of plates (treatment, control, etc) are run in a single batch, each type of file would need a different config file because the `external_metadata` file is specified for all the plates in a config file. Allowing the user to set the name of the `external_metadata` file at the plate level will allow them to run multiple types of plates in multiple batches using the same config file.

**Setting site name at the plate/batch level**
*Difficulty: 3*
Currently, all the fields of view to aggregate have to be the same for all plates in a config file. If set at the plate level, then multiple plates with different FoVs to aggregate can be run together.

**Setting input and output file names for each block**
*Difficulty: 4*
The order of operations (aggregation, annotation, normalization and feature selection) is done in a predetermined order because the output of one operation is the input of another. By specifying the names of the input and output files, it will be possible to run the operations in any order. Until we move over to a more powerful WDL-like setup for running the workflow, this would provide the functionality of running operations in any order. This would also allow adding new annotations to profiles without running normalization and feature selection, which was requested by Anne. 

Greg mentions this in https://github.com/cytomining/profiling-recipe/issues/13

Here is some more context for the linear execution strategy - https://github.com/cytomining/profiling-recipe/issues/11

**Make the `normalize` block more general**
*Difficulty: 3*
Currently, each type of normalization (whole plate and negcon) require different types of blocks (normalize and normalize_negcon). If the input and output names are allowed to be specified, then only a single type of block will be needed. The block will have a parameter to specify which type of normalization to perform (whole plate or negcon).

**Combining `collate.py `and the recipe**
*Difficulty: 5*
The recipe will greatly benefit from merging with `collate.py` because it could use `collate.py`'s ability to run in parallel. `collate.py` might also benefit from the recipe because it will have a home :) and the user will be able to interact with it using the config file, instead of the command line. Also, the recipe and `collate.py` call the same pycytominer function, and it makes sense for the two are merged.

**Create directories as part of a recipe step**
Difficulty: 1
https://github.com/cytomining/profiling-recipe/issues/8

**Include consensus building as a recipe step**
Difficulty: 2
https://github.com/cytomining/profiling-recipe/issues/14


Now that you have made it through the list, there are a few questions that need to be answered
- Who will implement these features? I can implement some of them, but I won't have the time to implement all of them.
- Is anyone interested in contributing to the recipe?
- Are there other feature request? I have captured all of Nasim's suggestions, but the other image analysts may have other feature requests.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipe wish list #30

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Recipe wish list #30

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions