Skip to content

Add support for semantic search#954

Closed
jazairi wants to merge 1 commit intomainfrom
use-411-use-412
Closed

Add support for semantic search#954
jazairi wants to merge 1 commit intomainfrom
use-411-use-412

Conversation

@jazairi
Copy link
Copy Markdown
Contributor

@jazairi jazairi commented Apr 1, 2026

Why these changes are being introduced:

Now that we have the Semantic Query Builder lambda, we need to integrate it into TIMDEX.

Relevant ticket(s):

How this addresses that need:

  • Adds SemanticQueryBuilder service to invoke the lambda
  • Adds queryMode parameter (keyword/semantic) to GraphQL interface
  • Implements builder selection pattern in Opensearch model
  • Renames AWS credentials to generic variables (AWS_ACCESS_KEY_ID, etc.)
  • Add comprehensive unit tests (9 semantic builder + 3 queryMode tests)
  • Add aws-sdk-lambda and mocha dependencies
  • Update documentation

Side effects of this change:

Adds aws-sdk-lambda and mocha as dependencies.

Developer

  • All new ENV is documented in README
  • All new ENV has been added to Heroku Pipeline, Staging and Prod
  • ANDI or Wave has been run in accordance to
    our guide and
    all issues introduced by these changes have been resolved or opened as new
    issues (link to those issues in the Pull Request details above)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines
    (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Requires database migrations?

NO

Includes new or updated dependencies?

YES

@qltysh
Copy link
Copy Markdown

qltysh bot commented Apr 1, 2026

❌ 12 blocking issues (15 total)

Tool Category Rule Count
rubocop Lint Avoid parameter lists longer than 5 parameters. [18/5] 4
rubocop Style Prefer keyword arguments for arguments with a boolean default value; use fulltext: false instead of fulltext = false. 3
rubocop Lint Assignment Branch Condition size for construct\_query is too high. [<24, 33, 0> 40.8/17] 2
rubocop Lint Block has too many lines. [62/25] 1
rubocop Style Line is too long. [143/120] 1
rubocop Lint Method has too many lines. [15/10] 1
qlty Structure Function with many parameters (count = 18): search 3

@@ -103,11 +105,11 @@ def record_id(id:, index:)
end

def search(searchterm:, citation:, contributors:, funding_information:, geodistance:, geobox:, identifiers:,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 issues:

1. Function with many parameters (count = 18): search [qlty:function-parameters]


2. Avoid parameter lists longer than 5 parameters. [18/5] [rubocop:Metrics/ParameterLists]


def search(searchterm:, citation:, contributors:, funding_information:, geodistance:, geobox:, identifiers:,
locations:, subjects:, title:, index:, source:, from:, boolean_type:, fulltext:, per_page: 20, **filters)
locations:, subjects:, title:, index:, source:, from:, boolean_type:, fulltext:, per_page: 20, query_mode: 'keyword', **filters)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is too long. [143/120] [rubocop:Layout/LineLength]

@@ -135,9 +137,10 @@ def inject_hits_fields_into_source(hits)
end

def construct_query(searchterm, citation, contributors, funding_information, geodistance, geobox, identifiers,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 issues:

1. Function with many parameters (count = 15): construct_query [qlty:function-parameters]


2. Avoid parameter lists longer than 5 parameters. [15/5] [rubocop:Metrics/ParameterLists]

MAX_SIZE = 200

def search(from, params, client, highlight = false, index = nil, fulltext = false)
def search(from, params, client, highlight = false, index = nil, fulltext = false, query_mode = 'keyword')
Copy link
Copy Markdown

@qltysh qltysh bot Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 4 issues:

1. Function with many parameters (count = 7): search [qlty:function-parameters]


2. Avoid parameter lists longer than 5 parameters. [7/5] [rubocop:Metrics/ParameterLists]


3. Prefer keyword arguments for arguments with a boolean default value; use fulltext: false instead of fulltext = false. [rubocop:Style/OptionalBooleanParameter]


4. Prefer keyword arguments for arguments with a boolean default value; use highlight: false instead of highlight = false. [rubocop:Style/OptionalBooleanParameter]

@query_mode = query_mode
index = default_index unless index.present?
client.search(index:,
body: build_query(from))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method has too many optional parameters. [4/3] [rubocop:Metrics/ParameterLists]

@mitlib mitlib temporarily deployed to timdex-api-p-use-411-us-cqtlti April 1, 2026 19:56 Inactive
@jazairi jazairi requested a review from Copilot April 1, 2026 19:59
@jazairi jazairi temporarily deployed to timdex-api-p-use-411-us-cqtlti April 1, 2026 20:00 Inactive
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates a new Semantic Query Builder AWS Lambda into TIMDEX search, exposing a queryMode GraphQL argument to switch between keyword (lexical) and semantic query construction.

Changes:

  • Added SemanticQueryBuilder to invoke an AWS Lambda and parse its returned OpenSearch query.
  • Added query_mode / queryMode parameter through the GraphQL layer into the OpenSearch query-building path.
  • Updated AWS credential ENV naming (generic AWS_ACCESS_KEY_ID, etc.), added dependencies (aws-sdk-lambda, mocha), and introduced unit tests for semantic querying and mode selection.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
app/models/semantic_query_builder.rb New Lambda-invoking semantic query builder.
app/models/opensearch.rb Selects lexical vs semantic builder based on query_mode.
app/graphql/types/query_type.rb Adds queryMode argument and passes through to OpenSearch.
config/initializers/opensearch.rb Renames AWS credential ENV vars used for SigV4 signing.
test/models/semantic_query_builder_test.rb New unit tests for semantic builder behavior and errors.
test/models/opensearch_test.rb New tests for query-mode builder selection.
test/test_helper.rb Loads Mocha for Minitest.
Gemfile / Gemfile.lock Adds aws-sdk-lambda and mocha dependencies.
README.md Updates ENV documentation and adds semantic builder function name.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

end

def parse_lambda_payload(payload)
JSON.parse(payload)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aws::Lambda::Client#invoke returns a response whose payload is typically an IO-like object (e.g., StringIO), not a plain JSON string. Calling JSON.parse(payload) will raise a TypeError in that case. Consider normalizing the payload (e.g., read the IO and parse the resulting string) before parsing, and add a test case that uses an IO-like payload to prevent regressions.

Suggested change
JSON.parse(payload)
payload_str =
if payload.respond_to?(:read)
payload.read
else
payload.to_s
end
JSON.parse(payload_str)

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +22
client = Aws::Lambda::Client.new(
region: ENV.fetch('AWS_REGION', 'us-east-1'),
access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID', nil),
secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY', nil)
)

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aws::Lambda::Client.new is configured with only access key and secret key, but the app now supports session tokens for AWS OpenSearch (AWS_SESSION_TOKEN). If callers use temporary credentials (common for personal/SSO profiles), the Lambda call will fail without passing the session token (or letting the AWS SDK use its default credential provider chain). Consider including AWS_SESSION_TOKEN when present or removing explicit credential arguments so the SDK can resolve credentials (including tokens) correctly.

Suggested change
client = Aws::Lambda::Client.new(
region: ENV.fetch('AWS_REGION', 'us-east-1'),
access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID', nil),
secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY', nil)
)
client_options = {
region: ENV.fetch('AWS_REGION', 'us-east-1')
}
access_key_id = ENV['AWS_ACCESS_KEY_ID']
secret_access_key = ENV['AWS_SECRET_ACCESS_KEY']
session_token = ENV['AWS_SESSION_TOKEN']
if access_key_id.present? && secret_access_key.present?
client_options[:access_key_id] = access_key_id
client_options[:secret_access_key] = secret_access_key
client_options[:session_token] = session_token if session_token.present?
end
client = Aws::Lambda::Client.new(client_options)

Copilot uses AI. Check for mistakes.
payload = { query: query_text }

response = client.invoke(
function_name: ENV.fetch('TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME', 'timdex-semantic-builder-prod:live'),
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME to a production Lambda alias (timdex-semantic-builder-prod:live) risks unintentionally invoking production from dev/staging when the ENV is missing (and the PR checklist notes this ENV hasn’t been added to Heroku yet). Consider failing fast when the ENV is unset, or defaulting based on environment (e.g., dev/stage), to avoid unexpected prod traffic/costs.

Suggested change
function_name: ENV.fetch('TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME', 'timdex-semantic-builder-prod:live'),
function_name: ENV.fetch('TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME'),

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +11
def mock_lambda_invoke(query_text, response_data)
mock_response = Struct.new(:payload).new(response_data.to_json)
Aws::Lambda::Client.any_instance.expects(:invoke).returns(mock_response)
end
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test double for invoke sets payload to a JSON string, but the real AWS SDK typically exposes payload as an IO-like object (e.g., StringIO). This test setup can mask parsing issues in SemanticQueryBuilder. Consider returning an IO-like payload in the mock (and asserting it’s handled) so the tests reflect actual SDK behavior.

Copilot uses AI. Check for mistakes.
@jazairi jazairi temporarily deployed to timdex-api-p-use-411-us-cqtlti April 1, 2026 20:16 Inactive
@jazairi jazairi temporarily deployed to timdex-api-p-use-411-us-cqtlti April 1, 2026 20:45 Inactive
require 'aws-sdk-lambda'

class SemanticQueryBuilder
def build(params, _fulltext = false)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer keyword arguments for arguments with a boolean default value; use _fulltext: false instead of _fulltext = false. [rubocop:Style/OptionalBooleanParameter]


parse_lambda_payload(response.payload)
rescue StandardError => e
raise "Semantic query builder Lambda error: #{e.message}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method has too many lines. [15/10] [rubocop:Metrics/MethodLength]

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +52 to +54
query = lambda_response['query']

raise "Invalid semantic query builder response: missing 'query' key" unless query.present?
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query.present? is used to decide whether the 'query' key is missing, but present? is false for an empty Hash. If the Lambda returns { "query": {} }, this will raise “missing 'query' key” even though the key exists. Prefer checking lambda_response.key?('query') (and then separately validating non-empty / expected structure) so the error message is accurate.

Suggested change
query = lambda_response['query']
raise "Invalid semantic query builder response: missing 'query' key" unless query.present?
raise "Invalid semantic query builder response: missing 'query' key" unless lambda_response.key?('query')
query = lambda_response['query']

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +10
setup do
# Set required environment variables for Lambda
ENV['TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME'] = 'timdex-semantic-builder-test'
ENV['AWS_ACCESS_KEY_ID'] = 'test-key'
ENV['AWS_SECRET_ACCESS_KEY'] = 'test-secret'
ENV['AWS_REGION'] = 'us-east-1'
@builder = SemanticQueryBuilder.new
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test suite sets AWS-related ENV vars in setup but never restores the previous values, which can leak state into other tests and cause order-dependent failures. Use ClimateControl.modify (preferred elsewhere in the suite) or save/restore ENV in teardown.

Copilot uses AI. Check for mistakes.
def mock_lambda_invoke(query_text, response_data)
# Use StringIO to simulate real AWS SDK behavior (payload is IO-like, not a plain string)
mock_response = Struct.new(:payload).new(StringIO.new(response_data.to_json))
Aws::Lambda::Client.any_instance.expects(:invoke).returns(mock_response)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mock_lambda_invoke stubs Aws::Lambda::Client#invoke but doesn’t assert that the call is made with the expected function_name and request payload derived from query_text (the query_text argument is currently unused). Adding an argument expectation here would better cover the new request-shaping logic in invoke_semantic_builder.

Suggested change
Aws::Lambda::Client.any_instance.expects(:invoke).returns(mock_response)
Aws::Lambda::Client.any_instance.expects(:invoke).with do |params|
# Ensure we call the expected Lambda function with a payload derived from query_text
assert_equal ENV['TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME'], params[:function_name]
parsed_payload = JSON.parse(params[:payload])
assert_equal query_text, parsed_payload['query_text']
true
end.returns(mock_response)

Copilot uses AI. Check for mistakes.
Comment on lines +161 to 165
- `AWS_ACCESS_KEY_ID`: AWS credentials for OpenSearch and Lambda
- `AWS_SECRET_ACCESS_KEY`: AWS credentials for OpenSearch and Lambda
- `AWS_REGION`: AWS region for OpenSearch and Lambda services
- `AWS_OPENSEARCH`: boolean. Set to true to enable AWSv4 Signing for OpenSearch
- `OPENSEARCH_INDEX`: Opensearch index or alias to query, default will be to search all indexes which is generally not
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code now supports session-based AWS credentials via AWS_SESSION_TOKEN (see OpenSearch initializer), but this ENV isn’t documented here. Add AWS_SESSION_TOKEN (optional) to the environment variable list and clarify it’s required when using temporary credentials.

Copilot uses AI. Check for mistakes.
@jazairi jazairi requested a review from Copilot April 1, 2026 22:33
@jazairi jazairi temporarily deployed to timdex-api-p-use-411-us-cqtlti April 1, 2026 22:33 Inactive
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

OpenSearch::Client.new log: ENV.fetch('OPENSEARCH_LOG', false), url: ENV['OPENSEARCH_URL'] do |config|
# personal keys use expiring credentials with tokens
if ENV.fetch('AWS_OPENSEARCH_SESSION_TOKEN', false)
if ENV.fetch('AWS_SESSION_TOKEN', false)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENV.fetch('AWS_SESSION_TOKEN', false) will evaluate to truthy even when the var is set to an empty string, which will force SigV4 signing to include a blank session token and can break auth. Prefer checking presence (e.g., ENV['AWS_SESSION_TOKEN'].present?) before selecting the session-token branch.

Suggested change
if ENV.fetch('AWS_SESSION_TOKEN', false)
if ENV['AWS_SESSION_TOKEN'].present?

Copilot uses AI. Check for mistakes.
locations, subjects, title, source, boolean_type, filters, per_page, query_mode = 'keyword')
query = {}
query[:q] = searchterm
query[:query_mode] = query_mode
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construct_query adds query[:query_mode], but Opensearch#search already receives query_mode as a separate argument and none of the query builders read params[:query_mode]. This makes query_mode in the params hash unused and potentially confusing; consider removing it or switching Opensearch#query to derive the mode from @params[:query_mode] consistently.

Suggested change
query[:query_mode] = query_mode

Copilot uses AI. Check for mistakes.
Comment on lines +161 to +162
- `AWS_ACCESS_KEY_ID`: AWS credentials for OpenSearch and Lambda
- `AWS_SECRET_ACCESS_KEY`: AWS credentials for OpenSearch and Lambda
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation lists the renamed AWS credential variables, but the code also supports AWS_SESSION_TOKEN for temporary credentials (see config/initializers/opensearch.rb) and semantic search will likely need it too. Please document AWS_SESSION_TOKEN (and whether keys are required vs optional when using an IAM role/metadata credentials) to avoid deployment/config surprises.

Suggested change
- `AWS_ACCESS_KEY_ID`: AWS credentials for OpenSearch and Lambda
- `AWS_SECRET_ACCESS_KEY`: AWS credentials for OpenSearch and Lambda
- `AWS_ACCESS_KEY_ID`: AWS access key for OpenSearch and Lambda. Required when using explicit static or temporary
credentials. When running with an IAM role or metadata-based credentials (e.g., EC2/ECS/Lambda role), this is normally
resolved automatically and should not be set.
- `AWS_SECRET_ACCESS_KEY`: AWS secret access key for OpenSearch and Lambda. Required when using explicit static or
temporary credentials. When running with an IAM role or metadata-based credentials, this is normally resolved
automatically and should not be set.
- `AWS_SESSION_TOKEN`: AWS session token for temporary (STS) credentials. Required when using temporary credentials
issued by STS. When running with long-lived access keys or with an IAM role / metadata-based credentials, this is
typically not set.

Copilot uses AI. Check for mistakes.
@jazairi jazairi requested a review from Copilot April 2, 2026 15:02
@jazairi jazairi temporarily deployed to timdex-api-p-use-411-us-cqtlti April 2, 2026 15:05 Inactive
payload = JSON.parse(params[:payload])
assert_equal query_text, payload['query']
true
end.returns(mock_response)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assignment Branch Condition size for mock_lambda_invoke is too high. [<3, 17, 0> 17.26/17] [rubocop:Metrics/AbcSize]

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 17 to 21
# personal keys use expiring credentials with tokens
if ENV.fetch('AWS_OPENSEARCH_SESSION_TOKEN', false)
if ENV.fetch('AWS_SESSION_TOKEN', false)
config.request :aws_sigv4,
service: 'es',
region: ENV['AWS_REGION'],
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENV.fetch('AWS_SESSION_TOKEN', false) will treat an empty string as truthy, causing requests to be signed with an empty session token and likely fail. Safer is to branch on presence (e.g., ENV['AWS_SESSION_TOKEN'].present?) so only non-blank tokens enable the session-token path.

Copilot uses AI. Check for mistakes.
Why these changes are being introduced:

Now that we have the Semantic Query Builder lambda, we need to
integrate it into TIMDEX.

Relevant ticket(s):

- [USE-411](https://mitlibraries.atlassian.net/browse/USE-411)
- [USE-412](https://mitlibraries.atlassian.net/browse/USE-412)

How this addresses that need:

- Adds SemanticQueryBuilder service to invoke the lambda
- Adds queryMode parameter (keyword/semantic) to GraphQL interface
- Implements builder selection pattern in Opensearch model
- Renames AWS credentials to generic variables (AWS_ACCESS_KEY_ID, etc.)
- Add comprehensive unit tests (9 semantic builder + 3 queryMode tests)
- Add aws-sdk-lambda and mocha dependencies
- Update documentation

Side effects of this change:

Adds aws-sdk-lambda and mocha as dependencies.
@jazairi jazairi closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants