Merge pull request #47 from vcon-dev/bee-feelings

SunainaKhan · web-flow · commit 41f14fd48f6e · 2025-04-30T18:54:30.000+05:30
Bee feelings
diff --git a/server/links/analyze_and_label/README.md b/server/links/analyze_and_label/README.md
@@ -0,0 +1,111 @@
+# Analyze and Label Link
+
+## Overview
+
+The `analyze_and_label` link is a powerful component of the vCon server that automatically analyzes dialog content and generates relevant labels/tags for categorization. It uses OpenAI's language models to process various dialog formats (transcripts, messages, chats, emails) and extract meaningful labels that are then applied as tags to the vCon.
+
+## How It Works
+
+1. The link retrieves a vCon from Redis storage
+2. For each dialog in the vCon, it checks if a source analysis (typically of type "transcript") is present
+3. It extracts the text content from the source analysis (from the specified location in the configuration)
+4. It sends the text to OpenAI's API with a customizable prompt
+5. It processes the API response to extract labels
+6. It adds the analysis as a new analysis object to the vCon
+7. It applies each extracted label as a tag to the vCon
+
+## Supported Dialog Formats
+
+The link is designed to handle various text formats that might appear in dialogs, including:
+
+- **Standard Transcripts**: Plain text transcripts of conversations
+- **Email Format**: Text with headers, subject, body, etc.
+- **Chat Format**: Text with timestamps and speaker identification
+- **Message Format**: Text with headers and body
+
+The link is able to intelligently process these different formats and extract appropriate labels regardless of the format.
+
+## Configuration Options
+
+The link accepts the following configuration options:
+
+| Option | Description | Default |
+|--------|-------------|--------|
+| `prompt` | The prompt sent to OpenAI for analysis | "Analyze this transcript and provide a list of relevant labels for categorization..." |
+| `analysis_type` | The type assigned to the analysis output | "labeled_analysis" |
+| `model` | The OpenAI model to use | "gpt-4-turbo" |
+| `sampling_rate` | Rate at which to run the analysis (1 = 100%, 0.5 = 50%, etc.) | 1 |
+| `temperature` | The temperature parameter for the OpenAI API | 0.2 |
+| `source.analysis_type` | The type of analysis to use as source | "transcript" |
+| `source.text_location` | The JSON path to the text within the source analysis | "body.paragraphs.transcript" |
+| `response_format` | Format specification for the OpenAI API response | `{"type": "json_object"}` |
+| `OPENAI_API_KEY` | The OpenAI API key (required but not defined in defaults) | None |
+
+## Usage Example
+
+```python
+from server.links.analyze_and_label import run
+
+# Run with default options (requires OPENAI_API_KEY in the options)
+run(
+    vcon_uuid="your-vcon-uuid",
+    link_name="analyze_and_label",
+    opts={
+        "OPENAI_API_KEY": "your-openai-api-key",
+        # Optionally override other defaults
+        "prompt": "Identify key topics, sentiments, and issues in this conversation. Return your response as a JSON object with a single key 'labels' containing an array of strings.",
+        "model": "gpt-3.5-turbo"
+    }
+)
+```
+
+## Customizing Label Generation
+
+You can customize the label generation process by modifying the `prompt` parameter. The prompt should instruct the model to return labels in a specific format - a JSON object with a "labels" key containing an array of strings.
+
+Example specialized prompts:
+
+- **Support Issues**: "Analyze this transcript and identify the specific support issues mentioned. Return your response as a JSON object with a single key 'labels' containing an array of issue categories."
+- **Sentiment Analysis**: "Analyze this conversation and identify the customer's sentiments and emotional states. Return your response as a JSON object with a single key 'labels' containing an array of sentiment descriptors."
+- **Product Mentions**: "Identify all products or services mentioned in this transcript. Return your response as a JSON object with a single key 'labels' containing an array of product names."
+
+## Error Handling
+
+The link includes robust error handling:
+
+- Exponential backoff retry mechanism for API calls
+- JSON parsing error handling
+- Logging of errors and performance metrics
+
+## Testing
+
+The link includes comprehensive tests for all functionality. To run the tests with actual OpenAI API calls (optional):
+
+```bash
+# Set environment variables
+export OPENAI_API_KEY="your-api-key"
+export RUN_OPENAI_ANALYZE_LABEL_TESTS=1
+
+# Run the tests
+pytest server/links/analyze_and_label/tests/test_analyze_and_label.py
+```
+
+Without setting `RUN_OPENAI_ANALYZE_LABEL_TESTS=1`, tests will run with mocked API responses.
+
+## Metrics and Monitoring
+
+The link emits several metrics for monitoring:
+
+- `conserver.link.openai.labels_added`: Number of labels added per run
+- `conserver.link.openai.analysis_time`: Time taken for analysis
+- `conserver.link.openai.json_parse_failures`: Count of JSON parsing failures
+- `conserver.link.openai.analysis_failures`: Count of overall analysis failures
+
+## Integration with vCon Structure
+
+The link integrates with the vCon structure in two ways:
+
+1. It adds a new analysis object with the `labeled_analysis` type (or the configured type)
+2. It adds tags to the vCon based on the extracted labels
+
+This allows for both structured access to the full analysis and quick filtering/categorization using the applied tags.
diff --git a/server/links/analyze_and_label/__init__.py b/server/links/analyze_and_label/__init__.py
@@ -0,0 +1,211 @@
+from lib.vcon_redis import VconRedis
+from lib.logging_utils import init_logger
+import logging
+import json
+from openai import OpenAI
+from tenacity import (
+    retry,
+    stop_after_attempt,
+    wait_exponential,
+    before_sleep_log,
+)  # for exponential backoff
+from lib.metrics import init_metrics, stats_gauge, stats_count
+import time
+from lib.links.filters import is_included, randomly_execute_with_sampling
+
+init_metrics()
+
+logger = init_logger(__name__)
+
+default_options = {
+    "prompt": "Analyze this transcript and provide a list of relevant labels for categorization. Return your response as a JSON object with a single key 'labels' containing an array of strings.",
+    "analysis_type": "labeled_analysis",
+    "model": "gpt-4-turbo",
+    "sampling_rate": 1,
+    "temperature": 0.2,
+    "source": {
+        "analysis_type": "transcript",
+        "text_location": "body.paragraphs.transcript",
+    },
+    "response_format": {"type": "json_object"}
+}
+
+
+def get_analysis_for_type(vcon, index, analysis_type):
+    for a in vcon.analysis:
+        if a["dialog"] == index and a["type"] == analysis_type:
+            return a
+    return None
+
+
+@retry(
+    wait=wait_exponential(multiplier=2, min=1, max=65),
+    stop=stop_after_attempt(6),
+    before_sleep=before_sleep_log(logger, logging.INFO),
+)
+def generate_analysis_with_labels(transcript, prompt, model, temperature, client, response_format) -> dict:
+    messages = [
+        {"role": "system", "content": "You are a helpful assistant that analyzes text and provides relevant labels."},
+        {"role": "user", "content": prompt + "\n\n" + transcript},
+    ]
+
+    response = client.chat.completions.create(
+        model=model, 
+        messages=messages, 
+        temperature=temperature,
+        response_format=response_format
+    )
+    
+    return response.choices[0].message.content
+
+
+def run(
+    vcon_uuid,
+    link_name,
+    opts=default_options,
+):
+    module_name = __name__.split(".")[-1]
+    logger.info(f"Starting {module_name}: {link_name} plugin for: {vcon_uuid}")
+    merged_opts = default_options.copy()
+    merged_opts.update(opts)
+    opts = merged_opts
+
+    vcon_redis = VconRedis()
+    vCon = vcon_redis.get_vcon(vcon_uuid)
+
+    if not is_included(opts, vCon):
+        logger.info(f"Skipping {link_name} vCon {vcon_uuid} due to filters")
+        return vcon_uuid
+
+    if not randomly_execute_with_sampling(opts):
+        logger.info(f"Skipping {link_name} vCon {vcon_uuid} due to sampling")
+        return vcon_uuid
+
+    client = OpenAI(api_key=opts["OPENAI_API_KEY"], timeout=120.0, max_retries=0)
+    source_type = navigate_dict(opts, "source.analysis_type")
+    text_location = navigate_dict(opts, "source.text_location")
+
+    for index, dialog in enumerate(vCon.dialog):
+        source = get_analysis_for_type(vCon, index, source_type)
+        if not source:
+            logger.warning("No %s found for vCon: %s", source_type, vCon.uuid)
+            continue
+        source_text = navigate_dict(source, text_location)
+        if not source_text:
+            logger.warning("No source_text found at %s for vCon: %s", text_location, vCon.uuid)
+            continue
+        analysis = get_analysis_for_type(vCon, index, opts["analysis_type"])
+
+        # See if it already has the analysis
+        if analysis:
+            logger.info(
+                "Dialog %s already has a %s in vCon: %s",
+                index,
+                opts["analysis_type"],
+                vCon.uuid,
+            )
+            continue
+
+        logger.info(
+            "Analysing dialog %s with options: %s",
+            index,
+            {k: v for k, v in opts.items() if k != "OPENAI_API_KEY"},
+        )
+        start = time.time()
+        try:
+            # Get the structured analysis with labels
+            analysis_json_str = generate_analysis_with_labels(
+                transcript=source_text,
+                prompt=opts["prompt"],
+                model=opts["model"],
+                temperature=opts["temperature"],
+                client=client,
+                response_format=opts.get("response_format", {"type": "json_object"})
+            )
+            
+            # Parse the response to get labels
+            try:
+                analysis_data = json.loads(analysis_json_str)
+                labels = analysis_data.get("labels", [])
+                
+                # Add the structured analysis to the vCon
+                vendor_schema = {}
+                vendor_schema["model"] = opts["model"]
+                vendor_schema["prompt"] = opts["prompt"]
+                vCon.add_analysis(
+                    type=opts["analysis_type"],
+                    dialog=index,
+                    vendor="openai",
+                    body=analysis_json_str,
+                    encoding="json",
+                    extra={
+                        "vendor_schema": vendor_schema,
+                    },
+                )
+                
+                # Apply each label as a tag
+                for label in labels:
+                    vCon.add_tag(tag_name=label, tag_value=label)
+                    logger.info(f"Applied label as tag: {label}")
+                
+                stats_gauge(
+                    "conserver.link.openai.labels_added",
+                    len(labels),
+                    tags=[f"analysis_type:{opts['analysis_type']}"],
+                )
+                
+            except json.JSONDecodeError as e:
+                logger.error(f"Failed to parse JSON response for vCon {vcon_uuid}: {e}")
+                stats_count(
+                    "conserver.link.openai.json_parse_failures",
+                    tags=[f"analysis_type:{opts['analysis_type']}"],
+                )
+                # Add the raw text anyway as the analysis
+                vCon.add_analysis(
+                    type=opts["analysis_type"],
+                    dialog=index,
+                    vendor="openai",
+                    body=analysis_json_str,
+                    encoding="none",
+                    extra={
+                        "vendor_schema": {
+                            "model": opts["model"],
+                            "prompt": opts["prompt"],
+                            "parse_error": str(e)
+                        },
+                    },
+                )
+                
+        except Exception as e:
+            logger.error(
+                "Failed to generate analysis for vCon %s after multiple retries: %s",
+                vcon_uuid,
+                e,
+            )
+            stats_count(
+                "conserver.link.openai.analysis_failures",
+                tags=[f"analysis_type:{opts['analysis_type']}"],
+            )
+            raise e
+
+        stats_gauge(
+            "conserver.link.openai.analysis_time",
+            time.time() - start,
+            tags=[f"analysis_type:{opts['analysis_type']}"],
+        )
+
+    vcon_redis.store_vcon(vCon)
+    logger.info(f"Finished analyze_and_label - {module_name}:{link_name} plugin for: {vcon_uuid}")
+
+    return vcon_uuid
+
+
+def navigate_dict(dictionary, path):
+    keys = path.split(".")
+    current = dictionary
+    for key in keys:
+        if key in current:
+            current = current[key]
+        else:
+            return None
+    return current
diff --git a/server/links/analyze_and_label/tests/__init__.py b/server/links/analyze_and_label/tests/__init__.py
@@ -0,0 +1 @@
+
diff --git a/server/links/analyze_and_label/tests/test_analyze_and_label.py b/server/links/analyze_and_label/tests/test_analyze_and_label.py