Skip to content

arpanauts/phenome-outlier-analysis

Repository files navigation

Phenome Outlier Analysis

Overview

Phenome Outlier Analysis is a Python package that provides advanced tools for detecting outliers in phenome datasets. It supports both context-specific and global outlier detection strategies, allowing you to identify anomalies within different segments of your data or across the entire dataset.

Key features include:

  • Multiple Normalization Methods: Supports Double Median Absolute Deviation (Double MAD) and Z-score.
  • Context-Specific Detection: Allows outlier detection within defined segments or contexts (e.g., based on demographic variables like sex, age group).
  • Global Detection: Performs outlier detection across the entire dataset without segmentation.
  • Flexible Context Generation: Offers options to analyze individual context categories, combinations of context categories, or both.
  • Unified Output: Returns a consolidated DataFrame containing original values, normalized values, and outlier flags for ease of analysis.

Table of Contents

Installation

You can install the Phenome Outlier Analysis package directly from the Git repository using pip:

pip install git+https://git.phenome.health/trent.leslie/phenome-outlier-analysis.git

This command will clone the repository and install the package along with its dependencies as specified in the setup.py file.

Note: Ensure you have git installed on your system to use the pip install git+ command.

Usage

Initialization

To use the module, start by importing the OutlierDetector class from the installed package and initializing it with your DataFrame and the relevant columns.

import pandas as pd
from phenome_outlier_analysis import OutlierDetector

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# Define the columns to analyze for outliers
analyte_columns = ['analyte1', 'analyte2', 'analyte3']

# Optionally, define columns to segment the data (e.g., 'Sex', 'Age_Group')
segment_columns = ['Sex', 'Age_Group']

# Specify the subject ID column
subject_id_column = 'Participant_ID'

# Initialize the OutlierDetector
detector = OutlierDetector(
    df,
    analyte_columns,
    segment_columns=segment_columns,
    subject_id_column=subject_id_column,
    context_combination='combination'  # Options: 'individual', 'combination', 'both'
)

Performing Outlier Detection

You can perform outlier detection using the perform_outlier_detection method. This method allows you to specify the normalization method, whether to apply a logarithmic transformation, and the percentile cutoffs for defining outliers.

# Perform outlier detection
results = detector.perform_outlier_detection(
    lower_percentile=0.01,
    upper_percentile=0.99,
    method='double_mad',  # Options: 'double_mad', 'zscore', 'iqr'
    take_log=False
)

Accessing Results

To access all detected outliers in a consolidated DataFrame, use the get_all_outliers method.

# Get all outliers
outliers_df = results.get_all_outliers()

You can also access context-specific results:

# Access context-specific results
context_results = results.context_results  # List of OutlierResult objects

# For example, get outliers for a specific context
for context_result in context_results:
    if context_result.context == 'Sex=Male, Age_Group=30-40':
        male_30_40_outliers = context_result.get_outliers()
        break

Controlling Context Combinations

You can control how contexts are generated using the context_combination parameter:

# Analyze individual context categories only
detector = OutlierDetector(
    df,
    analyte_columns,
    segment_columns=['Sex', 'Age_Group'],
    subject_id_column='Participant_ID',
    context_combination='individual'
)

# Perform outlier detection
results = detector.perform_outlier_detection(method='double_mad')

# Get all outliers
outliers_df = results.get_all_outliers()

Output Structure

The output is a consolidated DataFrame containing all detected outliers with the following columns:

  • subject_id_column: Identifier for the subject or observation (e.g., 'Participant_ID').
  • Segment Columns: Each of the segment columns used (e.g., 'Sex', 'Age_Group').
  • analyte: The analyte or variable for which the outlier was detected.
  • is_outlier: A boolean flag indicating whether the observation is an outlier.
  • analyte_normalized: The normalized value of the analyte (after applying the normalization method).
  • context: The context or segment in which the outlier was detected (e.g., 'Sex=Male, Age_Group=30-40').
  • detection_type: Indicates whether the outlier was detected in a context-specific analysis or globally ('Context-specific' or 'Global').

Example Output

   Participant_ID    Sex Age_Group    analyte  is_outlier  analyte_normalized  \
0               1   Male     20-30  Analyte1        True             5.237   
1               2 Female     30-40  Analyte2        True            -4.986   
2               3   Male     40-50  Analyte3        True             4.893   

                       context  detection_type  
0     Sex=Male, Age_Group=20-30  Context-specific  
1   Sex=Female, Age_Group=30-40  Context-specific  
2              global                    Global  

API Reference

OutlierDetector

Initialization

detector = OutlierDetector(
    df,
    analyte_columns,
    segment_columns=['Sex'],
    subject_id_column='Participant_ID',
    context_combination='combination'
)
  • df (pd.DataFrame): The input DataFrame.
  • analyte_columns (list): List of column names to analyze for outliers.
  • segment_columns (list, optional): List of columns to segment the data.
  • subject_id_column (str, optional): Name of the column representing subject IDs. Default is 'subject_id'.
  • context_combination (str, optional): How to generate contexts. Options are 'individual', 'combination', or 'both'. Default is 'combination'.

Methods

  • perform_outlier_detection(lower_percentile=0.01, upper_percentile=0.99, method='double_mad', take_log=False, n_jobs=1)

    Performs outlier detection and returns an OutlierDetectionResults object.

    • lower_percentile (float): Lower percentile cutoff. Default is 0.01 (1st percentile).
    • upper_percentile (float): Upper percentile cutoff. Default is 0.99 (99th percentile).
    • method (str): Normalization method. Options are 'double_mad', 'zscore', or 'iqr'.
    • take_log (bool): Whether to apply a logarithmic transformation to the data before normalization.
    • n_jobs (int): Number of CPU cores to use for parallel processing. Use -1 to use all available cores.
  • get_all_outliers(lower_percentile=0.01, upper_percentile=0.99, method='double_mad', take_log=False, n_jobs=1)

    Convenience method that performs outlier detection and returns a consolidated DataFrame of outliers.

OutlierDetectionResults

Attributes

  • context_results (list): List of OutlierResult objects for each context.
  • global_result (OutlierResult): OutlierResult object for the global analysis.

Methods

  • get_all_outliers()

    Returns a consolidated DataFrame of all outliers across contexts and global detection.

OutlierResult

Attributes

  • context (str): The context name.
  • normalized_df (pd.DataFrame): DataFrame containing original data and normalized values.
  • binary_matrix (pd.DataFrame): DataFrame indicating outlier flags (True/False).
  • lower_cutoff (float): Lower cutoff value used for outlier detection.
  • upper_cutoff (float): Upper cutoff value used for outlier detection.
  • analyte_columns (list): List of analyte columns analyzed.
  • segment_columns (list): List of segment columns used for context segmentation.
  • subject_id_column (str): Name of the column representing subject IDs.

Methods

  • get_outliers()

    Returns a DataFrame of outliers within the context.


## Notes

### Binary Matrix Values

The binary matrix uses boolean values (`True`/`False`) to indicate whether an observation is an outlier. If you need to perform numerical computations, you can convert the boolean values to integers (0/1). Here's how you can do this:

```python
# Convert binary matrix columns to integers
binary_matrix_int = results.global_result.binary_matrix.copy()
for col in analyte_columns:
    binary_matrix_int[col] = binary_matrix_int[col].astype(int)

# Now, binary_matrix_int has 0/1 instead of True/False
print(binary_matrix_int.head())

Context Combination Options

The context_combination parameter controls how contexts are generated for context-specific outlier detection:

  • 'individual': Analyzes each context category separately.
  • 'combination': Analyzes all possible combinations of context values (default).
  • 'both': Includes both individual categories and their combinations.

Consider performance implications when choosing the context combination option, especially for datasets with many context categories.

Contributing

Contributions to the Phenome Outlier Analysis package are welcome! Here's how you can contribute:

  1. Fork the repository on GitHub.
  2. Create a new branch for your feature or bug fix.
  3. Write tests for your changes.
  4. Implement your feature or bug fix.
  5. Run the test suite to ensure all tests pass.
  6. Submit a pull request with a clear description of your changes.

Please ensure your code adheres to the existing style conventions and includes appropriate tests.

License

This project is licensed under the MIT License. See the LICENSE file for details.


For questions, issues, or feature requests, please use the GitHub issue tracker.

Happy analyzing!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages