OrcaGlue

OrcaGlue

OrcaGlue

OrcaGlue – Frontline ETL for Pipeline Automation Data Warehouse

Development

A Python shop!

This project is all about writing a Python script that runs on AWS Glue as an ETL job.

Typical ETL script should be a small and a focused task.
Hence, the repo is structured as in a monorepo manner.
There are multiple ETL modules organised into subdirectories.
Each module has its own README file to follow.

Create a Python virtual environment (any method) and install the dev toolchain requirements.

See README_DEV.md for Python version requirement and more comprehensive setup details.

conda activate oncoglue
make install
make check

New ETL Module

Create a new project using the template and go to the project directory.

cp -R _template sample-job
cd sample-job

Login to Pulumi backend. Need an authenticated AWS session.

pulumi whoami --verbose --non-interactive

export AWS_PROFILE=unimelb-warehouse-prod-admin
aws sso login

pulumi login s3://pulumi-state-115253169271-ap-southeast-2-an/orcaglue

Initialize the Pulumi dev stack.

pulumi stack init dev --secrets-provider="awskms://alias/pulumi-state-key"

Deploy the ETL.

pulumi stack select dev
pulumi stack ls
pulumi preview
pulumi up
pulumi stack output
pulumi stack
pulumi stack --show-urns

Try to run the job via AWS CLI. You may opt to do so via Glue Console UI as well.

aws glue list-jobs
aws glue start-job-run --job-name orcaglue-dev-sample-job-job
aws glue get-job-run --job-name orcaglue-dev-sample-job-job --run-id jr_1cd13010b965e071fee72fa776211224feb6b3e0f2d42f7fe87178485cdeab65

Tear down the stack.

pulumi destroy
pulumi stack rm dev

Clean up the project directory.

cd ..
rm -rf sample-job

Simple Data Loading

Sometimes, you might have a use case that requires a simple data loading job without going through the Glue ETL pipeline, yet.
This may be a use case that you are still exploring before fully committing to the Glue ETL pipeline setup.
For these kinds of use cases, it is possible to leverage the simplified Redshift data loading via Query Editor.

See README_REDSHIFT.md for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
_template		_template
shared-infra		shared-infra
shared-lib		shared-lib
spreadsheet-google-lims		spreadsheet-google-lims
spreadsheet-library-tracking-metadata		spreadsheet-library-tracking-metadata
.bashrc		.bashrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
Brewfile		Brewfile
Makefile		Makefile
README.md		README.md
README_DEV.md		README_DEV.md
README_GLUE_JOB.md		README_GLUE_JOB.md
README_LOCAL.md		README_LOCAL.md
README_REDSHIFT.md		README_REDSHIFT.md
compose.yml		compose.yml
dx.sh		dx.sh
local.mk		local.mk
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OrcaGlue

Development

New ETL Module

Simple Data Loading

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OrcaGlue

Development

New ETL Module

Simple Data Loading

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages