Phenome Outlier Analysis is a Python package that provides advanced tools for detecting outliers in phenome datasets. It supports both context-specific and global outlier detection strategies, allowing you to identify anomalies within different segments of your data or across the entire dataset.
Key features include:
- Multiple Normalization Methods: Supports Double Median Absolute Deviation (Double MAD) and Z-score.
- Context-Specific Detection: Allows outlier detection within defined segments or contexts (e.g., based on demographic variables like sex, age group).
- Global Detection: Performs outlier detection across the entire dataset without segmentation.
- Flexible Context Generation: Offers options to analyze individual context categories, combinations of context categories, or both.
- Unified Output: Returns a consolidated DataFrame containing original values, normalized values, and outlier flags for ease of analysis.
You can install the Phenome Outlier Analysis package directly from the Git repository using pip:
pip install git+https://git.phenome.health/trent.leslie/phenome-outlier-analysis.gitThis command will clone the repository and install the package along with its dependencies as specified in the setup.py file.
Note: Ensure you have git installed on your system to use the pip install git+ command.
To use the module, start by importing the OutlierDetector class from the installed package and initializing it with your DataFrame and the relevant columns.
import pandas as pd
from phenome_outlier_analysis import OutlierDetector
# Load your dataset
df = pd.read_csv('your_dataset.csv')
# Define the columns to analyze for outliers
analyte_columns = ['analyte1', 'analyte2', 'analyte3']
# Optionally, define columns to segment the data (e.g., 'Sex', 'Age_Group')
segment_columns = ['Sex', 'Age_Group']
# Specify the subject ID column
subject_id_column = 'Participant_ID'
# Initialize the OutlierDetector
detector = OutlierDetector(
df,
analyte_columns,
segment_columns=segment_columns,
subject_id_column=subject_id_column,
context_combination='combination' # Options: 'individual', 'combination', 'both'
)You can perform outlier detection using the perform_outlier_detection method. This method allows you to specify the normalization method, whether to apply a logarithmic transformation, and the percentile cutoffs for defining outliers.
# Perform outlier detection
results = detector.perform_outlier_detection(
lower_percentile=0.01,
upper_percentile=0.99,
method='double_mad', # Options: 'double_mad', 'zscore', 'iqr'
take_log=False
)To access all detected outliers in a consolidated DataFrame, use the get_all_outliers method.
# Get all outliers
outliers_df = results.get_all_outliers()You can also access context-specific results:
# Access context-specific results
context_results = results.context_results # List of OutlierResult objects
# For example, get outliers for a specific context
for context_result in context_results:
if context_result.context == 'Sex=Male, Age_Group=30-40':
male_30_40_outliers = context_result.get_outliers()
breakYou can control how contexts are generated using the context_combination parameter:
# Analyze individual context categories only
detector = OutlierDetector(
df,
analyte_columns,
segment_columns=['Sex', 'Age_Group'],
subject_id_column='Participant_ID',
context_combination='individual'
)
# Perform outlier detection
results = detector.perform_outlier_detection(method='double_mad')
# Get all outliers
outliers_df = results.get_all_outliers()The output is a consolidated DataFrame containing all detected outliers with the following columns:
- subject_id_column: Identifier for the subject or observation (e.g., 'Participant_ID').
- Segment Columns: Each of the segment columns used (e.g., 'Sex', 'Age_Group').
- analyte: The analyte or variable for which the outlier was detected.
- is_outlier: A boolean flag indicating whether the observation is an outlier.
- analyte_normalized: The normalized value of the analyte (after applying the normalization method).
- context: The context or segment in which the outlier was detected (e.g., 'Sex=Male, Age_Group=30-40').
- detection_type: Indicates whether the outlier was detected in a context-specific analysis or globally ('Context-specific' or 'Global').
Participant_ID Sex Age_Group analyte is_outlier analyte_normalized \
0 1 Male 20-30 Analyte1 True 5.237
1 2 Female 30-40 Analyte2 True -4.986
2 3 Male 40-50 Analyte3 True 4.893
context detection_type
0 Sex=Male, Age_Group=20-30 Context-specific
1 Sex=Female, Age_Group=30-40 Context-specific
2 global Global
detector = OutlierDetector(
df,
analyte_columns,
segment_columns=['Sex'],
subject_id_column='Participant_ID',
context_combination='combination'
)- df (
pd.DataFrame): The input DataFrame. - analyte_columns (
list): List of column names to analyze for outliers. - segment_columns (
list, optional): List of columns to segment the data. - subject_id_column (
str, optional): Name of the column representing subject IDs. Default is'subject_id'. - context_combination (
str, optional): How to generate contexts. Options are'individual','combination', or'both'. Default is'combination'.
-
perform_outlier_detection(lower_percentile=0.01, upper_percentile=0.99, method='double_mad', take_log=False, n_jobs=1)Performs outlier detection and returns an
OutlierDetectionResultsobject.- lower_percentile (
float): Lower percentile cutoff. Default is0.01(1st percentile). - upper_percentile (
float): Upper percentile cutoff. Default is0.99(99th percentile). - method (
str): Normalization method. Options are'double_mad','zscore', or'iqr'. - take_log (
bool): Whether to apply a logarithmic transformation to the data before normalization. - n_jobs (
int): Number of CPU cores to use for parallel processing. Use -1 to use all available cores.
- lower_percentile (
-
get_all_outliers(lower_percentile=0.01, upper_percentile=0.99, method='double_mad', take_log=False, n_jobs=1)Convenience method that performs outlier detection and returns a consolidated DataFrame of outliers.
- context_results (
list): List ofOutlierResultobjects for each context. - global_result (
OutlierResult):OutlierResultobject for the global analysis.
-
get_all_outliers()Returns a consolidated DataFrame of all outliers across contexts and global detection.
- context (
str): The context name. - normalized_df (
pd.DataFrame): DataFrame containing original data and normalized values. - binary_matrix (
pd.DataFrame): DataFrame indicating outlier flags (True/False). - lower_cutoff (
float): Lower cutoff value used for outlier detection. - upper_cutoff (
float): Upper cutoff value used for outlier detection. - analyte_columns (
list): List of analyte columns analyzed. - segment_columns (
list): List of segment columns used for context segmentation. - subject_id_column (
str): Name of the column representing subject IDs.
-
get_outliers()Returns a DataFrame of outliers within the context.
## Notes
### Binary Matrix Values
The binary matrix uses boolean values (`True`/`False`) to indicate whether an observation is an outlier. If you need to perform numerical computations, you can convert the boolean values to integers (0/1). Here's how you can do this:
```python
# Convert binary matrix columns to integers
binary_matrix_int = results.global_result.binary_matrix.copy()
for col in analyte_columns:
binary_matrix_int[col] = binary_matrix_int[col].astype(int)
# Now, binary_matrix_int has 0/1 instead of True/False
print(binary_matrix_int.head())
The context_combination parameter controls how contexts are generated for context-specific outlier detection:
- 'individual': Analyzes each context category separately.
- 'combination': Analyzes all possible combinations of context values (default).
- 'both': Includes both individual categories and their combinations.
Consider performance implications when choosing the context combination option, especially for datasets with many context categories.
Contributions to the Phenome Outlier Analysis package are welcome! Here's how you can contribute:
- Fork the repository on GitHub.
- Create a new branch for your feature or bug fix.
- Write tests for your changes.
- Implement your feature or bug fix.
- Run the test suite to ensure all tests pass.
- Submit a pull request with a clear description of your changes.
Please ensure your code adheres to the existing style conventions and includes appropriate tests.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions, issues, or feature requests, please use the GitHub issue tracker.
Happy analyzing!