Conversation
❌ 12 blocking issues (15 total)
|
| @@ -103,11 +105,11 @@ def record_id(id:, index:) | |||
| end | |||
|
|
|||
| def search(searchterm:, citation:, contributors:, funding_information:, geodistance:, geobox:, identifiers:, | |||
|
|
||
| def search(searchterm:, citation:, contributors:, funding_information:, geodistance:, geobox:, identifiers:, | ||
| locations:, subjects:, title:, index:, source:, from:, boolean_type:, fulltext:, per_page: 20, **filters) | ||
| locations:, subjects:, title:, index:, source:, from:, boolean_type:, fulltext:, per_page: 20, query_mode: 'keyword', **filters) |
| @@ -135,9 +137,10 @@ def inject_hits_fields_into_source(hits) | |||
| end | |||
|
|
|||
| def construct_query(searchterm, citation, contributors, funding_information, geodistance, geobox, identifiers, | |||
| MAX_SIZE = 200 | ||
|
|
||
| def search(from, params, client, highlight = false, index = nil, fulltext = false) | ||
| def search(from, params, client, highlight = false, index = nil, fulltext = false, query_mode = 'keyword') |
There was a problem hiding this comment.
Found 4 issues:
1. Function with many parameters (count = 7): search [qlty:function-parameters]
2. Avoid parameter lists longer than 5 parameters. [7/5] [rubocop:Metrics/ParameterLists]
3. Prefer keyword arguments for arguments with a boolean default value; use fulltext: false instead of fulltext = false. [rubocop:Style/OptionalBooleanParameter]
4. Prefer keyword arguments for arguments with a boolean default value; use highlight: false instead of highlight = false. [rubocop:Style/OptionalBooleanParameter]
| @query_mode = query_mode | ||
| index = default_index unless index.present? | ||
| client.search(index:, | ||
| body: build_query(from)) |
There was a problem hiding this comment.
Pull request overview
Integrates a new Semantic Query Builder AWS Lambda into TIMDEX search, exposing a queryMode GraphQL argument to switch between keyword (lexical) and semantic query construction.
Changes:
- Added
SemanticQueryBuilderto invoke an AWS Lambda and parse its returned OpenSearch query. - Added
query_mode/queryModeparameter through the GraphQL layer into the OpenSearch query-building path. - Updated AWS credential ENV naming (generic
AWS_ACCESS_KEY_ID, etc.), added dependencies (aws-sdk-lambda,mocha), and introduced unit tests for semantic querying and mode selection.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
app/models/semantic_query_builder.rb |
New Lambda-invoking semantic query builder. |
app/models/opensearch.rb |
Selects lexical vs semantic builder based on query_mode. |
app/graphql/types/query_type.rb |
Adds queryMode argument and passes through to OpenSearch. |
config/initializers/opensearch.rb |
Renames AWS credential ENV vars used for SigV4 signing. |
test/models/semantic_query_builder_test.rb |
New unit tests for semantic builder behavior and errors. |
test/models/opensearch_test.rb |
New tests for query-mode builder selection. |
test/test_helper.rb |
Loads Mocha for Minitest. |
Gemfile / Gemfile.lock |
Adds aws-sdk-lambda and mocha dependencies. |
README.md |
Updates ENV documentation and adds semantic builder function name. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
app/models/semantic_query_builder.rb
Outdated
| end | ||
|
|
||
| def parse_lambda_payload(payload) | ||
| JSON.parse(payload) |
There was a problem hiding this comment.
Aws::Lambda::Client#invoke returns a response whose payload is typically an IO-like object (e.g., StringIO), not a plain JSON string. Calling JSON.parse(payload) will raise a TypeError in that case. Consider normalizing the payload (e.g., read the IO and parse the resulting string) before parsing, and add a test case that uses an IO-like payload to prevent regressions.
| JSON.parse(payload) | |
| payload_str = | |
| if payload.respond_to?(:read) | |
| payload.read | |
| else | |
| payload.to_s | |
| end | |
| JSON.parse(payload_str) |
app/models/semantic_query_builder.rb
Outdated
| client = Aws::Lambda::Client.new( | ||
| region: ENV.fetch('AWS_REGION', 'us-east-1'), | ||
| access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID', nil), | ||
| secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY', nil) | ||
| ) | ||
|
|
There was a problem hiding this comment.
Aws::Lambda::Client.new is configured with only access key and secret key, but the app now supports session tokens for AWS OpenSearch (AWS_SESSION_TOKEN). If callers use temporary credentials (common for personal/SSO profiles), the Lambda call will fail without passing the session token (or letting the AWS SDK use its default credential provider chain). Consider including AWS_SESSION_TOKEN when present or removing explicit credential arguments so the SDK can resolve credentials (including tokens) correctly.
| client = Aws::Lambda::Client.new( | |
| region: ENV.fetch('AWS_REGION', 'us-east-1'), | |
| access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID', nil), | |
| secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY', nil) | |
| ) | |
| client_options = { | |
| region: ENV.fetch('AWS_REGION', 'us-east-1') | |
| } | |
| access_key_id = ENV['AWS_ACCESS_KEY_ID'] | |
| secret_access_key = ENV['AWS_SECRET_ACCESS_KEY'] | |
| session_token = ENV['AWS_SESSION_TOKEN'] | |
| if access_key_id.present? && secret_access_key.present? | |
| client_options[:access_key_id] = access_key_id | |
| client_options[:secret_access_key] = secret_access_key | |
| client_options[:session_token] = session_token if session_token.present? | |
| end | |
| client = Aws::Lambda::Client.new(client_options) |
app/models/semantic_query_builder.rb
Outdated
| payload = { query: query_text } | ||
|
|
||
| response = client.invoke( | ||
| function_name: ENV.fetch('TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME', 'timdex-semantic-builder-prod:live'), |
There was a problem hiding this comment.
Defaulting TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME to a production Lambda alias (timdex-semantic-builder-prod:live) risks unintentionally invoking production from dev/staging when the ENV is missing (and the PR checklist notes this ENV hasn’t been added to Heroku yet). Consider failing fast when the ENV is unset, or defaulting based on environment (e.g., dev/stage), to avoid unexpected prod traffic/costs.
| function_name: ENV.fetch('TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME', 'timdex-semantic-builder-prod:live'), | |
| function_name: ENV.fetch('TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME'), |
| def mock_lambda_invoke(query_text, response_data) | ||
| mock_response = Struct.new(:payload).new(response_data.to_json) | ||
| Aws::Lambda::Client.any_instance.expects(:invoke).returns(mock_response) | ||
| end |
There was a problem hiding this comment.
The test double for invoke sets payload to a JSON string, but the real AWS SDK typically exposes payload as an IO-like object (e.g., StringIO). This test setup can mask parsing issues in SemanticQueryBuilder. Consider returning an IO-like payload in the mock (and asserting it’s handled) so the tests reflect actual SDK behavior.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
app/models/semantic_query_builder.rb
Outdated
| query = lambda_response['query'] | ||
|
|
||
| raise "Invalid semantic query builder response: missing 'query' key" unless query.present? |
There was a problem hiding this comment.
query.present? is used to decide whether the 'query' key is missing, but present? is false for an empty Hash. If the Lambda returns { "query": {} }, this will raise “missing 'query' key” even though the key exists. Prefer checking lambda_response.key?('query') (and then separately validating non-empty / expected structure) so the error message is accurate.
| query = lambda_response['query'] | |
| raise "Invalid semantic query builder response: missing 'query' key" unless query.present? | |
| raise "Invalid semantic query builder response: missing 'query' key" unless lambda_response.key?('query') | |
| query = lambda_response['query'] |
| setup do | ||
| # Set required environment variables for Lambda | ||
| ENV['TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME'] = 'timdex-semantic-builder-test' | ||
| ENV['AWS_ACCESS_KEY_ID'] = 'test-key' | ||
| ENV['AWS_SECRET_ACCESS_KEY'] = 'test-secret' | ||
| ENV['AWS_REGION'] = 'us-east-1' | ||
| @builder = SemanticQueryBuilder.new |
There was a problem hiding this comment.
This test suite sets AWS-related ENV vars in setup but never restores the previous values, which can leak state into other tests and cause order-dependent failures. Use ClimateControl.modify (preferred elsewhere in the suite) or save/restore ENV in teardown.
| def mock_lambda_invoke(query_text, response_data) | ||
| # Use StringIO to simulate real AWS SDK behavior (payload is IO-like, not a plain string) | ||
| mock_response = Struct.new(:payload).new(StringIO.new(response_data.to_json)) | ||
| Aws::Lambda::Client.any_instance.expects(:invoke).returns(mock_response) |
There was a problem hiding this comment.
mock_lambda_invoke stubs Aws::Lambda::Client#invoke but doesn’t assert that the call is made with the expected function_name and request payload derived from query_text (the query_text argument is currently unused). Adding an argument expectation here would better cover the new request-shaping logic in invoke_semantic_builder.
| Aws::Lambda::Client.any_instance.expects(:invoke).returns(mock_response) | |
| Aws::Lambda::Client.any_instance.expects(:invoke).with do |params| | |
| # Ensure we call the expected Lambda function with a payload derived from query_text | |
| assert_equal ENV['TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME'], params[:function_name] | |
| parsed_payload = JSON.parse(params[:payload]) | |
| assert_equal query_text, parsed_payload['query_text'] | |
| true | |
| end.returns(mock_response) |
| - `AWS_ACCESS_KEY_ID`: AWS credentials for OpenSearch and Lambda | ||
| - `AWS_SECRET_ACCESS_KEY`: AWS credentials for OpenSearch and Lambda | ||
| - `AWS_REGION`: AWS region for OpenSearch and Lambda services | ||
| - `AWS_OPENSEARCH`: boolean. Set to true to enable AWSv4 Signing for OpenSearch | ||
| - `OPENSEARCH_INDEX`: Opensearch index or alias to query, default will be to search all indexes which is generally not |
There was a problem hiding this comment.
The code now supports session-based AWS credentials via AWS_SESSION_TOKEN (see OpenSearch initializer), but this ENV isn’t documented here. Add AWS_SESSION_TOKEN (optional) to the environment variable list and clarify it’s required when using temporary credentials.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
config/initializers/opensearch.rb
Outdated
| OpenSearch::Client.new log: ENV.fetch('OPENSEARCH_LOG', false), url: ENV['OPENSEARCH_URL'] do |config| | ||
| # personal keys use expiring credentials with tokens | ||
| if ENV.fetch('AWS_OPENSEARCH_SESSION_TOKEN', false) | ||
| if ENV.fetch('AWS_SESSION_TOKEN', false) |
There was a problem hiding this comment.
ENV.fetch('AWS_SESSION_TOKEN', false) will evaluate to truthy even when the var is set to an empty string, which will force SigV4 signing to include a blank session token and can break auth. Prefer checking presence (e.g., ENV['AWS_SESSION_TOKEN'].present?) before selecting the session-token branch.
| if ENV.fetch('AWS_SESSION_TOKEN', false) | |
| if ENV['AWS_SESSION_TOKEN'].present? |
| locations, subjects, title, source, boolean_type, filters, per_page, query_mode = 'keyword') | ||
| query = {} | ||
| query[:q] = searchterm | ||
| query[:query_mode] = query_mode |
There was a problem hiding this comment.
construct_query adds query[:query_mode], but Opensearch#search already receives query_mode as a separate argument and none of the query builders read params[:query_mode]. This makes query_mode in the params hash unused and potentially confusing; consider removing it or switching Opensearch#query to derive the mode from @params[:query_mode] consistently.
| query[:query_mode] = query_mode |
| - `AWS_ACCESS_KEY_ID`: AWS credentials for OpenSearch and Lambda | ||
| - `AWS_SECRET_ACCESS_KEY`: AWS credentials for OpenSearch and Lambda |
There was a problem hiding this comment.
Documentation lists the renamed AWS credential variables, but the code also supports AWS_SESSION_TOKEN for temporary credentials (see config/initializers/opensearch.rb) and semantic search will likely need it too. Please document AWS_SESSION_TOKEN (and whether keys are required vs optional when using an IAM role/metadata credentials) to avoid deployment/config surprises.
| - `AWS_ACCESS_KEY_ID`: AWS credentials for OpenSearch and Lambda | |
| - `AWS_SECRET_ACCESS_KEY`: AWS credentials for OpenSearch and Lambda | |
| - `AWS_ACCESS_KEY_ID`: AWS access key for OpenSearch and Lambda. Required when using explicit static or temporary | |
| credentials. When running with an IAM role or metadata-based credentials (e.g., EC2/ECS/Lambda role), this is normally | |
| resolved automatically and should not be set. | |
| - `AWS_SECRET_ACCESS_KEY`: AWS secret access key for OpenSearch and Lambda. Required when using explicit static or | |
| temporary credentials. When running with an IAM role or metadata-based credentials, this is normally resolved | |
| automatically and should not be set. | |
| - `AWS_SESSION_TOKEN`: AWS session token for temporary (STS) credentials. Required when using temporary credentials | |
| issued by STS. When running with long-lived access keys or with an IAM role / metadata-based credentials, this is | |
| typically not set. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # personal keys use expiring credentials with tokens | ||
| if ENV.fetch('AWS_OPENSEARCH_SESSION_TOKEN', false) | ||
| if ENV.fetch('AWS_SESSION_TOKEN', false) | ||
| config.request :aws_sigv4, | ||
| service: 'es', | ||
| region: ENV['AWS_REGION'], |
There was a problem hiding this comment.
ENV.fetch('AWS_SESSION_TOKEN', false) will treat an empty string as truthy, causing requests to be signed with an empty session token and likely fail. Safer is to branch on presence (e.g., ENV['AWS_SESSION_TOKEN'].present?) so only non-blank tokens enable the session-token path.
Why these changes are being introduced: Now that we have the Semantic Query Builder lambda, we need to integrate it into TIMDEX. Relevant ticket(s): - [USE-411](https://mitlibraries.atlassian.net/browse/USE-411) - [USE-412](https://mitlibraries.atlassian.net/browse/USE-412) How this addresses that need: - Adds SemanticQueryBuilder service to invoke the lambda - Adds queryMode parameter (keyword/semantic) to GraphQL interface - Implements builder selection pattern in Opensearch model - Renames AWS credentials to generic variables (AWS_ACCESS_KEY_ID, etc.) - Add comprehensive unit tests (9 semantic builder + 3 queryMode tests) - Add aws-sdk-lambda and mocha dependencies - Update documentation Side effects of this change: Adds aws-sdk-lambda and mocha as dependencies.
Why these changes are being introduced:
Now that we have the Semantic Query Builder lambda, we need to integrate it into TIMDEX.
Relevant ticket(s):
How this addresses that need:
Side effects of this change:
Adds aws-sdk-lambda and mocha as dependencies.
Developer
our guide and
all issues introduced by these changes have been resolved or opened as new
issues (link to those issues in the Pull Request details above)
Code Reviewer
(not just this pull request message)
Requires database migrations?
NO
Includes new or updated dependencies?
YES