Skip to content

Commit 53f13e4

Browse files
committed
Add support for semantic search
Why these changes are being introduced: Now that we have the Semantic Query Builder lambda, we need to integrate it into TIMDEX. Relevant ticket(s): - [USE-411](https://mitlibraries.atlassian.net/browse/USE-411) - [USE-412](https://mitlibraries.atlassian.net/browse/USE-412) How this addresses that need: - Adds SemanticQueryBuilder service to invoke the lambda - Adds queryMode parameter (keyword/semantic) to GraphQL interface - Implements builder selection pattern in Opensearch model - Renames AWS credentials to generic variables (AWS_ACCESS_KEY_ID, etc.) Side effects of this change: - Adds aws-sdk-lambda and mocha as dependencies. - Makes small change to OpenSearch initializer: `ENV.fetch('AWS_SESSION_TOKEN', false)` returns an empty string, which will evaluate to truthy. This has been updated to check for the presence of the env var instead.
1 parent 0e4cfa8 commit 53f13e4

14 files changed

Lines changed: 274 additions & 29 deletions

.env.test

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,7 @@ EMAIL_FROM=fake@example.com
22
EMAIL_URL_HOST=localhost:3000
33
JWT_SECRET_KEY=3862fc949629030de4259b88f6e8f7c3702b2fabfc68d00d46fb7f9f70110690b526997ef4d77765ffa010d8aba440286af39947d0c85287174d99be2db14987
44
OPENSEARCH_INDEX=all-current
5+
AWS_ACCESS_KEY_ID=test-key
6+
AWS_SECRET_ACCESS_KEY=test-secret
7+
AWS_REGION=us-east-1
8+
TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME=timdex-semantic-builder-test

Gemfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ git_source(:github) { |repo| "https://github.com/#{repo}.git" }
33

44
ruby '3.4.8'
55

6+
gem 'aws-sdk-lambda'
67
gem 'bootsnap', require: false
78
gem 'devise'
89
gem 'faraday_middleware-aws-sigv4'
@@ -48,6 +49,7 @@ group :test do
4849
gem 'capybara'
4950
gem 'climate_control'
5051
gem 'minitest', '< 6' # required for Rails 7.2.3
52+
gem 'mocha'
5153
gem 'selenium-webdriver'
5254
gem 'simplecov', require: false
5355
gem 'simplecov-lcov', require: false

Gemfile.lock

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,18 @@ GEM
9090
rake (>= 10.4, < 14.0)
9191
ast (2.4.3)
9292
aws-eventstream (1.4.0)
93+
aws-partitions (1.1232.0)
94+
aws-sdk-core (3.244.0)
95+
aws-eventstream (~> 1, >= 1.3.0)
96+
aws-partitions (~> 1, >= 1.992.0)
97+
aws-sigv4 (~> 1.9)
98+
base64
99+
bigdecimal
100+
jmespath (~> 1, >= 1.6.1)
101+
logger
102+
aws-sdk-lambda (1.176.0)
103+
aws-sdk-core (~> 3, >= 3.244.0)
104+
aws-sigv4 (~> 1.5)
93105
aws-sigv4 (1.12.1)
94106
aws-eventstream (~> 1, >= 1.0.2)
95107
base64 (0.3.0)
@@ -205,6 +217,7 @@ GEM
205217
jekyll (>= 3.8, < 5.0)
206218
jekyll-watch (2.2.1)
207219
listen (~> 3.0)
220+
jmespath (1.6.2)
208221
json (2.18.1)
209222
json-schema (6.1.0)
210223
addressable (~> 2.8)
@@ -245,6 +258,8 @@ GEM
245258
mini_mime (1.1.5)
246259
mini_portile2 (2.8.9)
247260
minitest (5.27.0)
261+
mocha (3.1.0)
262+
ruby2_keywords (>= 0.0.5)
248263
msgpack (1.8.0)
249264
multi_json (1.19.1)
250265
net-http (0.9.1)
@@ -373,6 +388,7 @@ GEM
373388
rubocop (>= 1.75.0, < 2.0)
374389
rubocop-ast (>= 1.44.0, < 2.0)
375390
ruby-progressbar (1.13.0)
391+
ruby2_keywords (0.0.5)
376392
rubyzip (2.4.1)
377393
safe_yaml (1.0.5)
378394
sass-embedded (1.97.2)
@@ -461,6 +477,7 @@ PLATFORMS
461477

462478
DEPENDENCIES
463479
annotate
480+
aws-sdk-lambda
464481
bootsnap
465482
byebug
466483
capybara
@@ -479,6 +496,7 @@ DEPENDENCIES
479496
lograge
480497
minitest (< 6)
481498
mitlibraries-theme!
499+
mocha
482500
opensearch-ruby
483501
pg
484502
puma

README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,8 @@ The test recording process is as follows:
4545
- Set the following values to whatever cluster you want to connect to in `.env` (Note: `.env` is preferred over `.env.test` because it is already in our `.gitignore` and will work together with `.env.test`.)
4646
- OPENSEARCH_URL
4747
- AWS_OPENSEARCH=true
48-
- AWS_OPENSEARCH_ACCESS_KEY_ID
49-
- AWS_OPENSEARCH_SECRET_ACCESS_KEY
48+
- AWS_ACCESS_KEY_ID
49+
- AWS_SECRET_ACCESS_KEY
5050
- AWS_REGION
5151
- Delete any cassette you want to regenerate (for new tests, you can skip this). If you are making a graphql test, nest your cassette inside the `opensearch_init` cassette.
5252

@@ -158,29 +158,31 @@ locally.
158158

159159
## Production required Environment Variables
160160

161-
- `AWS_OPENSEARCH`: boolean. Set to true to enable AWSv4 Signing
162-
- `AWS_OPENSEARCH_ACCESS_KEY_ID`
163-
- `AWS_OPENSEARCH_SECRET_ACCESS_KEY`
164-
- `AWS_REGION`
161+
- `AWS_ACCESS_KEY_ID`: AWS credentials for OpenSearch and Lambda
162+
- `AWS_SECRET_ACCESS_KEY`: AWS credentials for OpenSearch and Lambda
163+
- `AWS_REGION`: AWS region for OpenSearch and Lambda services
164+
- `AWS_OPENSEARCH`: boolean. Set to true to enable AWSv4 Signing for OpenSearch
165165
- `OPENSEARCH_INDEX`: Opensearch index or alias to query, default will be to search all indexes which is generally not
166166
expected. `timdex` or `all-current` are aliases used consistently in our data pipelines, with
167167
`timdex` being most likely what most use cases will want.
168168
- `OPENSEARCH_URL`: Opensearch URL, defaults to `http://localhost:9200`
169+
- `TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME`: AWS Lambda function name with alias for semantic query building.
170+
Configurable to use alternative deployment tiers (e.g., dev1, stage, prod).
169171
- `SMTP_ADDRESS`
170172
- `SMTP_PASSWORD`
171173
- `SMTP_PORT`
172174
- `SMTP_USER`
173175

174176
## Optional Environment Variables (all ENVs)
175177

178+
- `AWS_SESSION_TOKEN`: AWS session token for temporary credentials when using expiring AWS credentials
176179
- `OPENSEARCH_LOG` if `true`, verbosely logs OpenSearch queries.
177180

178181
```text
179182
NOTE: do not set this ENV at all if you want ES logging fully disabled.
180183
Setting it to `false` is still setting it and you will be annoyed and
181184
confused.
182185
```
183-
184186
- `OPENSEARCH_SOURCE_EXCLUDES` comma separated list of fields to exclude from the OpenSearch `_source` field. Leave unset to return all fields.
185187
- recommended value: `embedding_full_record,fulltext`
186188
- `PLATFORM_NAME`: The value set is added to the header after the MIT Libraries logo. The logic and CSS for this comes from our theme gem.

app/graphql/types/query_type.rb

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,8 @@ def record_id(id:, index:)
6666
argument :boolean_type, String, required: false, default_value: 'OR',
6767
description: 'How to join multiword queries. Defaults to "OR" which means any ' \
6868
'of the words much match. Options include: "OR", "AND"'
69+
argument :query_mode, String, required: false, default_value: 'keyword',
70+
description: 'Search mode, either "keyword" or "semantic"'
6971

7072
# applied filters
7173
argument :access_to_files_filter, [String],
@@ -103,11 +105,11 @@ def record_id(id:, index:)
103105
end
104106

105107
def search(searchterm:, citation:, contributors:, funding_information:, geodistance:, geobox:, identifiers:,
106-
locations:, subjects:, title:, index:, source:, from:, boolean_type:, fulltext:, per_page: 20, **filters)
108+
locations:, subjects:, title:, index:, source:, from:, boolean_type:, fulltext:, per_page: 20, query_mode: 'keyword', **filters)
107109
query = construct_query(searchterm, citation, contributors, funding_information, geodistance, geobox, identifiers,
108-
locations, subjects, title, source, boolean_type, filters, per_page)
110+
locations, subjects, title, source, boolean_type, filters, per_page, query_mode)
109111

110-
results = Opensearch.new.search(from, query, Timdex::OSClient, highlight_requested?, index, fulltext)
112+
results = Opensearch.new.search(from, query, Timdex::OSClient, highlight: highlight_requested?, index: index, fulltext: fulltext, query_mode: query_mode)
111113

112114
response = {}
113115
response[:hits] = results['hits']['total']['value']
@@ -135,9 +137,10 @@ def inject_hits_fields_into_source(hits)
135137
end
136138

137139
def construct_query(searchterm, citation, contributors, funding_information, geodistance, geobox, identifiers,
138-
locations, subjects, title, source, boolean_type, filters, per_page)
140+
locations, subjects, title, source, boolean_type, filters, per_page, query_mode = 'keyword')
139141
query = {}
140142
query[:q] = searchterm
143+
query[:query_mode] = query_mode
141144
query[:boolean_type] = boolean_type
142145
query[:citation] = citation
143146
query[:contributors] = contributors

app/models/lexical_query_builder.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
class LexicalQueryBuilder
2-
def build(params, fulltext = false)
2+
def build(params, fulltext: false)
33
{
44
bool: {
55
should: multisearch(params),

app/models/opensearch.rb

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
1-
# rubocop:disable Metrics/ClassLength
21
# rubocop:disable Metrics/MethodLength
32
class Opensearch
43
SIZE = 20
54
MAX_SIZE = 200
65

7-
def search(from, params, client, highlight = false, index = nil, fulltext = false)
6+
def search(from, params, client, highlight: false, index: nil, fulltext: false, query_mode: 'keyword')
87
@params = params
98
@highlight = highlight
109
@fulltext = fulltext?(fulltext)
10+
@query_mode = query_mode
1111
index = default_index unless index.present?
1212
client.search(index:,
1313
body: build_query(from))
@@ -54,8 +54,14 @@ def build_query(from)
5454

5555
# Build the query portion of the elasticsearch json
5656
def query
57-
@query_strategy ||= LexicalQueryBuilder.new
58-
@query_strategy.build(@params, @fulltext)
57+
builder = case @query_mode
58+
when 'semantic'
59+
SemanticQueryBuilder.new
60+
else
61+
LexicalQueryBuilder.new
62+
end
63+
64+
builder.build(@params, fulltext: @fulltext)
5965
end
6066

6167
def sort_builder
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
class SemanticQueryBuilder
2+
def build(params, fulltext: false)
3+
query_text = params[:q].to_s.strip
4+
5+
# If no query text provided, return a match_all query (consistent with keyword search behavior)
6+
return { match_all: {} } if query_text.blank?
7+
8+
lambda_response = invoke_semantic_builder(query_text)
9+
parse_lambda_response(lambda_response)
10+
end
11+
12+
private
13+
14+
def invoke_semantic_builder(query_text)
15+
payload = { query: query_text }
16+
17+
response = Timdex::LambdaClient.invoke(
18+
function_name: ENV.fetch('TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME'),
19+
invocation_type: 'RequestResponse',
20+
payload: payload.to_json
21+
)
22+
23+
parse_lambda_payload(response.payload)
24+
rescue StandardError => e
25+
raise "Semantic query builder Lambda error: #{e.message}"
26+
end
27+
28+
def parse_lambda_payload(payload)
29+
# AWS Lambda response payload can be an IO-like object (e.g., StringIO) or a string
30+
payload_str = if payload.respond_to?(:read)
31+
payload.read
32+
else
33+
payload.to_s
34+
end
35+
JSON.parse(payload_str)
36+
rescue JSON::ParserError => e
37+
raise "Invalid JSON response from semantic query builder: #{e.message}"
38+
end
39+
40+
def parse_lambda_response(lambda_response)
41+
# Lambda returns: { "query": { "bool": { "should": [...] } } }
42+
# We extract and return just the inner query object
43+
raise "Invalid semantic query builder response: missing 'query' key" unless lambda_response.key?('query')
44+
45+
query = lambda_response['query']
46+
raise 'Invalid semantic query builder response: query must be a Hash' unless query.is_a?(Hash)
47+
48+
query
49+
end
50+
end

config/initializers/lambda.rb

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
require 'aws-sdk-lambda'
2+
3+
def configure_lambda_client
4+
Aws::Lambda::Client.new(
5+
region: ENV.fetch('AWS_REGION', 'us-east-1'),
6+
access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID'),
7+
secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY')
8+
)
9+
end
10+
11+
Timdex::LambdaClient = configure_lambda_client

config/initializers/opensearch.rb

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,20 +15,20 @@ def os_client
1515
def aws_os_client
1616
OpenSearch::Client.new log: ENV.fetch('OPENSEARCH_LOG', false), url: ENV['OPENSEARCH_URL'] do |config|
1717
# personal keys use expiring credentials with tokens
18-
if ENV.fetch('AWS_OPENSEARCH_SESSION_TOKEN', false)
18+
if ENV['AWS_SESSION_TOKEN'].present?
1919
config.request :aws_sigv4,
2020
service: 'es',
2121
region: ENV['AWS_REGION'],
22-
access_key_id: ENV['AWS_OPENSEARCH_ACCESS_KEY_ID'],
23-
secret_access_key: ENV['AWS_OPENSEARCH_SECRET_ACCESS_KEY'],
24-
session_token: ENV['AWS_OPENSEARCH_SESSION_TOKEN']
22+
access_key_id: ENV['AWS_ACCESS_KEY_ID'],
23+
secret_access_key: ENV['AWS_SECRET_ACCESS_KEY'],
24+
session_token: ENV['AWS_SESSION_TOKEN']
2525
# application keys don't use tokens
2626
else
2727
config.request :aws_sigv4,
2828
service: 'es',
2929
region: ENV['AWS_REGION'],
30-
access_key_id: ENV['AWS_OPENSEARCH_ACCESS_KEY_ID'],
31-
secret_access_key: ENV['AWS_OPENSEARCH_SECRET_ACCESS_KEY']
30+
access_key_id: ENV['AWS_ACCESS_KEY_ID'],
31+
secret_access_key: ENV['AWS_SECRET_ACCESS_KEY']
3232
end
3333
end
3434
end

0 commit comments

Comments
 (0)