Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .env.test
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,7 @@ EMAIL_FROM=fake@example.com
EMAIL_URL_HOST=localhost:3000
JWT_SECRET_KEY=3862fc949629030de4259b88f6e8f7c3702b2fabfc68d00d46fb7f9f70110690b526997ef4d77765ffa010d8aba440286af39947d0c85287174d99be2db14987
OPENSEARCH_INDEX=all-current
AWS_ACCESS_KEY_ID=test-key
AWS_SECRET_ACCESS_KEY=test-secret
AWS_REGION=us-east-1
TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME=timdex-semantic-builder-test
2 changes: 2 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ git_source(:github) { |repo| "https://github.com/#{repo}.git" }

ruby '3.4.8'

gem 'aws-sdk-lambda'
gem 'bootsnap', require: false
gem 'devise'
gem 'faraday_middleware-aws-sigv4'
Expand Down Expand Up @@ -48,6 +49,7 @@ group :test do
gem 'capybara'
gem 'climate_control'
gem 'minitest', '< 6' # required for Rails 7.2.3
gem 'mocha'
gem 'selenium-webdriver'
gem 'simplecov', require: false
gem 'simplecov-lcov', require: false
Expand Down
18 changes: 18 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,18 @@ GEM
rake (>= 10.4, < 14.0)
ast (2.4.3)
aws-eventstream (1.4.0)
aws-partitions (1.1232.0)
aws-sdk-core (3.244.0)
aws-eventstream (~> 1, >= 1.3.0)
aws-partitions (~> 1, >= 1.992.0)
aws-sigv4 (~> 1.9)
base64
bigdecimal
jmespath (~> 1, >= 1.6.1)
logger
aws-sdk-lambda (1.176.0)
aws-sdk-core (~> 3, >= 3.244.0)
aws-sigv4 (~> 1.5)
aws-sigv4 (1.12.1)
aws-eventstream (~> 1, >= 1.0.2)
base64 (0.3.0)
Expand Down Expand Up @@ -205,6 +217,7 @@ GEM
jekyll (>= 3.8, < 5.0)
jekyll-watch (2.2.1)
listen (~> 3.0)
jmespath (1.6.2)
json (2.18.1)
json-schema (6.1.0)
addressable (~> 2.8)
Expand Down Expand Up @@ -245,6 +258,8 @@ GEM
mini_mime (1.1.5)
mini_portile2 (2.8.9)
minitest (5.27.0)
mocha (3.1.0)
ruby2_keywords (>= 0.0.5)
msgpack (1.8.0)
multi_json (1.19.1)
net-http (0.9.1)
Expand Down Expand Up @@ -373,6 +388,7 @@ GEM
rubocop (>= 1.75.0, < 2.0)
rubocop-ast (>= 1.44.0, < 2.0)
ruby-progressbar (1.13.0)
ruby2_keywords (0.0.5)
rubyzip (2.4.1)
safe_yaml (1.0.5)
sass-embedded (1.97.2)
Expand Down Expand Up @@ -461,6 +477,7 @@ PLATFORMS

DEPENDENCIES
annotate
aws-sdk-lambda
bootsnap
byebug
capybara
Expand All @@ -479,6 +496,7 @@ DEPENDENCIES
lograge
minitest (< 6)
mitlibraries-theme!
mocha
opensearch-ruby
pg
puma
Expand Down
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ The test recording process is as follows:
- Set the following values to whatever cluster you want to connect to in `.env` (Note: `.env` is preferred over `.env.test` because it is already in our `.gitignore` and will work together with `.env.test`.)
- OPENSEARCH_URL
- AWS_OPENSEARCH=true
- AWS_OPENSEARCH_ACCESS_KEY_ID
- AWS_OPENSEARCH_SECRET_ACCESS_KEY
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_REGION
- Delete any cassette you want to regenerate (for new tests, you can skip this). If you are making a graphql test, nest your cassette inside the `opensearch_init` cassette.

Expand Down Expand Up @@ -158,29 +158,31 @@ locally.

## Production required Environment Variables

- `AWS_OPENSEARCH`: boolean. Set to true to enable AWSv4 Signing
- `AWS_OPENSEARCH_ACCESS_KEY_ID`
- `AWS_OPENSEARCH_SECRET_ACCESS_KEY`
- `AWS_REGION`
- `AWS_ACCESS_KEY_ID`: AWS credentials for OpenSearch and Lambda
- `AWS_SECRET_ACCESS_KEY`: AWS credentials for OpenSearch and Lambda
- `AWS_REGION`: AWS region for OpenSearch and Lambda services
- `AWS_OPENSEARCH`: boolean. Set to true to enable AWSv4 Signing for OpenSearch
- `OPENSEARCH_INDEX`: Opensearch index or alias to query, default will be to search all indexes which is generally not
expected. `timdex` or `all-current` are aliases used consistently in our data pipelines, with
`timdex` being most likely what most use cases will want.
- `OPENSEARCH_URL`: Opensearch URL, defaults to `http://localhost:9200`
- `TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME`: AWS Lambda function name with alias for semantic query building.
Configurable to use alternative deployment tiers (e.g., dev1, stage, prod).
- `SMTP_ADDRESS`
- `SMTP_PASSWORD`
- `SMTP_PORT`
- `SMTP_USER`

## Optional Environment Variables (all ENVs)

- `AWS_SESSION_TOKEN`: AWS session token for temporary credentials when using expiring AWS credentials
- `OPENSEARCH_LOG` if `true`, verbosely logs OpenSearch queries.

```text
NOTE: do not set this ENV at all if you want ES logging fully disabled.
Setting it to `false` is still setting it and you will be annoyed and
confused.
```

- `OPENSEARCH_SOURCE_EXCLUDES` comma separated list of fields to exclude from the OpenSearch `_source` field. Leave unset to return all fields.
- recommended value: `embedding_full_record,fulltext`
- `PLATFORM_NAME`: The value set is added to the header after the MIT Libraries logo. The logic and CSS for this comes from our theme gem.
Expand Down
11 changes: 7 additions & 4 deletions app/graphql/types/query_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ def record_id(id:, index:)
argument :boolean_type, String, required: false, default_value: 'OR',
description: 'How to join multiword queries. Defaults to "OR" which means any ' \
'of the words much match. Options include: "OR", "AND"'
argument :query_mode, String, required: false, default_value: 'keyword',
description: 'Search mode, either "keyword" or "semantic"'

# applied filters
argument :access_to_files_filter, [String],
Expand Down Expand Up @@ -103,11 +105,11 @@ def record_id(id:, index:)
end

def search(searchterm:, citation:, contributors:, funding_information:, geodistance:, geobox:, identifiers:,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 issues:

1. Function with many parameters (count = 18): search [qlty:function-parameters]


2. Avoid parameter lists longer than 5 parameters. [18/5] [rubocop:Metrics/ParameterLists]

locations:, subjects:, title:, index:, source:, from:, boolean_type:, fulltext:, per_page: 20, **filters)
locations:, subjects:, title:, index:, source:, from:, boolean_type:, fulltext:, per_page: 20, query_mode: 'keyword', **filters)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is too long. [143/120] [rubocop:Layout/LineLength]

query = construct_query(searchterm, citation, contributors, funding_information, geodistance, geobox, identifiers,
locations, subjects, title, source, boolean_type, filters, per_page)
locations, subjects, title, source, boolean_type, filters, per_page, query_mode)

results = Opensearch.new.search(from, query, Timdex::OSClient, highlight_requested?, index, fulltext)
results = Opensearch.new.search(from, query, Timdex::OSClient, highlight: highlight_requested?, index: index, fulltext: fulltext, query_mode: query_mode)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is too long. [159/120] [rubocop:Layout/LineLength]


response = {}
response[:hits] = results['hits']['total']['value']
Expand Down Expand Up @@ -135,9 +137,10 @@ def inject_hits_fields_into_source(hits)
end

def construct_query(searchterm, citation, contributors, funding_information, geodistance, geobox, identifiers,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 issues:

1. Function with many parameters (count = 15): construct_query [qlty:function-parameters]


2. Avoid parameter lists longer than 5 parameters. [15/5] [rubocop:Metrics/ParameterLists]

locations, subjects, title, source, boolean_type, filters, per_page)
locations, subjects, title, source, boolean_type, filters, per_page, query_mode = 'keyword')
query = {}
query[:q] = searchterm
query[:query_mode] = query_mode
query[:boolean_type] = boolean_type
query[:citation] = citation
query[:contributors] = contributors
Expand Down
2 changes: 1 addition & 1 deletion app/models/lexical_query_builder.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
class LexicalQueryBuilder
def build(params, fulltext = false)
def build(params, fulltext: false)
{
bool: {
should: multisearch(params),
Expand Down
14 changes: 10 additions & 4 deletions app/models/opensearch.rb
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# rubocop:disable Metrics/ClassLength
# rubocop:disable Metrics/MethodLength
class Opensearch
SIZE = 20
MAX_SIZE = 200

def search(from, params, client, highlight = false, index = nil, fulltext = false)
def search(from, params, client, highlight: false, index: nil, fulltext: false, query_mode: 'keyword')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 issues:

1. Function with many parameters (count = 7): search [qlty:function-parameters]


2. Avoid parameter lists longer than 5 parameters. [7/5] [rubocop:Metrics/ParameterLists]

@params = params
@highlight = highlight
@fulltext = fulltext?(fulltext)
@query_mode = query_mode
index = default_index unless index.present?
client.search(index:,
body: build_query(from))
Comment thread
qltysh[bot] marked this conversation as resolved.
Expand Down Expand Up @@ -54,8 +54,14 @@ def build_query(from)

# Build the query portion of the elasticsearch json
def query
@query_strategy ||= LexicalQueryBuilder.new
@query_strategy.build(@params, @fulltext)
builder = case @query_mode
when 'semantic'
SemanticQueryBuilder.new
else
LexicalQueryBuilder.new
end

builder.build(@params, fulltext: @fulltext)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the pattern we put in place anticipating this is working as cleanly as we hoped.

end

def sort_builder
Expand Down
50 changes: 50 additions & 0 deletions app/models/semantic_query_builder.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
class SemanticQueryBuilder
def build(params, fulltext: false)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused method argument - fulltext. [rubocop:Lint/UnusedMethodArgument]

query_text = params[:q].to_s.strip

# If no query text provided, return a match_all query (consistent with keyword search behavior)
return { match_all: {} } if query_text.blank?

lambda_response = invoke_semantic_builder(query_text)
parse_lambda_response(lambda_response)
end

private

def invoke_semantic_builder(query_text)
payload = { query: query_text }

response = Timdex::LambdaClient.invoke(
function_name: ENV.fetch('TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME'),
invocation_type: 'RequestResponse',
payload: payload.to_json
)

parse_lambda_payload(response.payload)
rescue StandardError => e
raise "Semantic query builder Lambda error: #{e.message}"
Comment thread
qltysh[bot] marked this conversation as resolved.
end

def parse_lambda_payload(payload)
# AWS Lambda response payload can be an IO-like object (e.g., StringIO) or a string
payload_str = if payload.respond_to?(:read)
payload.read
else
payload.to_s
end
JSON.parse(payload_str)
rescue JSON::ParserError => e
raise "Invalid JSON response from semantic query builder: #{e.message}"
end

def parse_lambda_response(lambda_response)
# Lambda returns: { "query": { "bool": { "should": [...] } } }
# We extract and return just the inner query object
raise "Invalid semantic query builder response: missing 'query' key" unless lambda_response.key?('query')

query = lambda_response['query']
raise 'Invalid semantic query builder response: query must be a Hash' unless query.is_a?(Hash)

query
end
end
11 changes: 11 additions & 0 deletions config/initializers/lambda.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
require 'aws-sdk-lambda'

def configure_lambda_client
Aws::Lambda::Client.new(
region: ENV.fetch('AWS_REGION', 'us-east-1'),
access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID'),
secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY')
)
end

Timdex::LambdaClient = configure_lambda_client
12 changes: 6 additions & 6 deletions config/initializers/opensearch.rb
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,20 @@ def os_client
def aws_os_client
OpenSearch::Client.new log: ENV.fetch('OPENSEARCH_LOG', false), url: ENV['OPENSEARCH_URL'] do |config|
# personal keys use expiring credentials with tokens
if ENV.fetch('AWS_OPENSEARCH_SESSION_TOKEN', false)
if ENV['AWS_SESSION_TOKEN'].present?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to have all of the affect ENV duplicated in all tiers before this merges please.

config.request :aws_sigv4,
service: 'es',
region: ENV['AWS_REGION'],
access_key_id: ENV['AWS_OPENSEARCH_ACCESS_KEY_ID'],
secret_access_key: ENV['AWS_OPENSEARCH_SECRET_ACCESS_KEY'],
session_token: ENV['AWS_OPENSEARCH_SESSION_TOKEN']
access_key_id: ENV['AWS_ACCESS_KEY_ID'],
secret_access_key: ENV['AWS_SECRET_ACCESS_KEY'],
session_token: ENV['AWS_SESSION_TOKEN']
# application keys don't use tokens
else
config.request :aws_sigv4,
service: 'es',
region: ENV['AWS_REGION'],
access_key_id: ENV['AWS_OPENSEARCH_ACCESS_KEY_ID'],
secret_access_key: ENV['AWS_OPENSEARCH_SECRET_ACCESS_KEY']
access_key_id: ENV['AWS_ACCESS_KEY_ID'],
secret_access_key: ENV['AWS_SECRET_ACCESS_KEY']
end
end
end
Expand Down
4 changes: 2 additions & 2 deletions test/models/lexical_query_builder_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -71,14 +71,14 @@ class LexicalQueryBuilderTest < ActiveSupport::TestCase
builder = LexicalQueryBuilder.new
params = { q: 'this' }

assert(builder.build(params, true).to_json.include?('"fields":["alternate_titles","call_numbers","citation","contents","contributors.value","dates.value","edition","funding_information.*","identifiers.value","languages","locations.value","notes.value","numbering","publication_information","subjects.value","summary","title","fulltext"]'))
assert(builder.build(params, fulltext: true).to_json.include?('"fields":["alternate_titles","call_numbers","citation","contents","contributors.value","dates.value","edition","funding_information.*","identifiers.value","languages","locations.value","notes.value","numbering","publication_information","subjects.value","summary","title","fulltext"]'))
Comment thread
JPrevost marked this conversation as resolved.
end

test 'fulltext is not included by default' do
builder = LexicalQueryBuilder.new
params = { q: 'this' }

assert(builder.build(params, false).to_json.include?('"fields":["alternate_titles","call_numbers","citation","contents","contributors.value","dates.value","edition","funding_information.*","identifiers.value","languages","locations.value","notes.value","numbering","publication_information","subjects.value","summary","title"]'))
assert(builder.build(params, fulltext: false).to_json.include?('"fields":["alternate_titles","call_numbers","citation","contents","contributors.value","dates.value","edition","funding_information.*","identifiers.value","languages","locations.value","notes.value","numbering","publication_information","subjects.value","summary","title"]'))
Comment thread
JPrevost marked this conversation as resolved.
end

test 'can search by geopoint' do
Expand Down
41 changes: 36 additions & 5 deletions test/models/opensearch_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ class OpensearchTest < ActiveSupport::TestCase
# fragile test: assumes opensearch instance with at least one index in the `geo` alias
VCR.use_cassette('opensearch non-default index') do
params = { title: 'bermuda' }
results = Opensearch.new.search(0, params, Timdex::OSClient, false, 'geo')
results = Opensearch.new.search(0, params, Timdex::OSClient, highlight: false, index: 'geo')
assert results['hits']['hits'].map { |hit| hit['_index'] }.uniq.map { |index| index.start_with?('gis') }.any?
end
end
Expand All @@ -15,7 +15,7 @@ class OpensearchTest < ActiveSupport::TestCase
# that start with rdi*
VCR.use_cassette('opensearch default index') do
params = { title: 'data' }
results = Opensearch.new.search(0, params, Timdex::OSClient)
results = Opensearch.new.search(0, params, Timdex::OSClient, highlight: false, index: nil)
refute results['hits']['hits'].map { |hit| hit['_index'] }.uniq.map { |index| index.start_with?('rdi') }.any?
assert results['hits']['hits'].map { |hit| hit['_index'] }.uniq.any?
end
Expand All @@ -24,7 +24,7 @@ class OpensearchTest < ActiveSupport::TestCase
test 'searches a single field' do
VCR.use_cassette('opensearch single field') do
params = { title: 'spice it up' }
results = Opensearch.new.search(0, params, Timdex::OSClient)
results = Opensearch.new.search(0, params, Timdex::OSClient, highlight: false, index: nil)
assert_equal 'Spice it up!',
results['hits']['hits'].first['_source']['title']
end
Expand All @@ -33,7 +33,7 @@ class OpensearchTest < ActiveSupport::TestCase
test 'searches a single field with nested subfields' do
VCR.use_cassette('opensearch single field nested') do
params = { contributors: 'mcternan' }
results = Opensearch.new.search(0, params, Timdex::OSClient)
results = Opensearch.new.search(0, params, Timdex::OSClient, highlight: false, index: nil)
assert_equal 'A common table : 80 recipes and stories from my shared cultures',
results['hits']['hits'].first['_source']['title']
end
Expand All @@ -42,7 +42,7 @@ class OpensearchTest < ActiveSupport::TestCase
test 'searches multiple fields' do
VCR.use_cassette('opensearch multiple fields') do
params = { q: 'chinese', title: 'common', contributors: 'mcternan' }
results = Opensearch.new.search(0, params, Timdex::OSClient)
results = Opensearch.new.search(0, params, Timdex::OSClient, highlight: false, index: nil)
assert_equal 'A common table : 80 recipes and stories from my shared cultures',
results['hits']['hits'].first['_source']['title']
end
Expand Down Expand Up @@ -108,4 +108,35 @@ class OpensearchTest < ActiveSupport::TestCase
refute json.key?('_source')
end
end

test 'uses LexicalQueryBuilder by default when queryMode is keyword' do
os = Opensearch.new
os.instance_variable_set(:@params, { q: 'test' })
os.instance_variable_set(:@fulltext, false)
os.instance_variable_set(:@query_mode, 'keyword')

expected_query = { bool: { must: [{ match: { text: { query: 'test' } } }] } }
mock_builder = mock
mock_builder.stubs(:build).returns(expected_query)
LexicalQueryBuilder.expects(:new).once.returns(mock_builder)
result = os.query

assert result.is_a?(Hash)
assert_includes(result.keys, :bool)
end

test 'uses SemanticQueryBuilder when queryMode is semantic' do
os = Opensearch.new
os.instance_variable_set(:@params, { q: 'test' })
os.instance_variable_set(:@fulltext, false)
os.instance_variable_set(:@query_mode, 'semantic')

mock_response = { 'bool' => { 'should' => [{ 'rank_feature' => { 'field' => 'test', 'boost' => 1.0 } }] } }
mock_builder = mock
mock_builder.stubs(:build).returns(mock_response)
SemanticQueryBuilder.expects(:new).once.returns(mock_builder)

result = os.query
assert_equal(mock_response, result)
end
end
Loading
Loading