-
Notifications
You must be signed in to change notification settings - Fork 15
Expand file tree
/
Copy path065-connect-to-adventureworks-with-dplyr.Rmd
More file actions
152 lines (118 loc) · 6.14 KB
/
065-connect-to-adventureworks-with-dplyr.Rmd
File metadata and controls
152 lines (118 loc) · 6.14 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# Connecting to the database with R code{#chapter_connect-to-db-with-r-code}
> This chapter demonstrates how to:
>
> * Connect to and disconnect R from the `adventureworks` database
> * Use dplyr to get an overview of the database, replicating the facilities provided by RStudio
These packages are called in this Chapter:
```{r setup, echo=TRUE, message=FALSE, warning=FALSE}
library(tidyverse)
library(DBI)
library(RPostgres)
library(glue)
require(knitr)
library(dbplyr)
library(sqlpetr)
library(bookdown)
library(here)
library(connections)
sleep_default <- 3
```
## Verify that Docker is up and running, and start the database
> The `sp_check_that_docker_is_up` function from the `sqlpetr` package checks whether Docker is up and running. If it's not, then you need to install, launch or re-install Docker.
```{r docker verify}
sp_check_that_docker_is_up()
```
```{r}
sp_docker_start("adventureworks")
```
## Connect to PostgreSQL
*CHECK for `sqlpetr` update!` The `sp_make_simple_pg` function we called above created a container from the
`postgres:11` library image downloaded from Docker Hub. As part of the process, it set the password for the PostgreSQL database superuser `postgres` to the value
"postgres".
For simplicity, we are using a weak password at this point and it's shown here
and in the code in plain text. That is bad practice because user credentials
should not be shared in open code like that. A [subsequent chapter](#dbms-login)
demonstrates how to store and use credentials to access the DBMS so that they
are kept private.
> The `sp_get_postgres_connection` function from the `sqlpetr` package gets a DBI connection string to a PostgreSQL database, waiting if it is not ready. This function connects to an instance of PostgreSQL and we assign it to a symbol, `con`, for subsequent use. The `connctions_tab = TRUE` parameter opens a connections tab that's useful for navigating a database.
> Note that we are using port *5439* for PostgreSQL inside the container and published to `localhost`. Why? If you have PostgreSQL already running on the host or another container, it probably claimed port 5432, since that's the default. So we need to use a different port for *our* PostgreSQL container.
Use the DBI package to connect to the `adventureworks` database in PostgreSQL. Remember the settings discussion about [keeping passwords hidden][Pause for some security considerations]
```{r }
# con <- connection_open( # use in an interactive session
Sys.sleep(sleep_default)
con <- dbConnect(
RPostgres::Postgres(),
# without the previous and next lines, some functions fail with bigint data
# so change int64 to integer
bigint = "integer",
host = "localhost",
port = 5432,
user = "postgres",
password = "postgres",
dbname = "adventureworks")
```
## Set schema search path and list its contents
Schemas will be discussed later on because multiple schemas are the norm in an enterprise database environment, but they are a side issue at this point. So we switch the order in which PostgreSQL searches for objects with the following SQL code:
```{r}
dbExecute(con, "set search_path to sales;")
```
With the custom `search_path`, the following command shows the tables in the `sales` schema. In the `adventureworks` database, there are no tables in the `public` schema.
```{r}
dbListTables(con)
```
Notice there are several tables that start with the letter *v*: they are actually *views* which will turn out to be important. They are clearly distinguished in the connections tab, but the naming is a matter of convention.
Same for `dbListFields`:
```{r }
dbListFields(con, "salesorderheader")
```
Thus with this search order, the following two produce identical results:
```{r }
tbl(con, in_schema("sales", "salesorderheader")) %>%
head()
tbl(con, "salesorderheader") %>%
head()
```
## Anatomy of a `dplyr` connection object
As introduced in the previous chapter, the `dplyr::tbl` function creates an object that might **look** like a data frame in that when you enter it on the command line, it prints a bunch of rows from the dbms table. But it is actually a **list** object that `dplyr` uses for constructing queries and retrieving data from the DBMS.
The following code illustrates these issues. The `dplyr::tbl` function creates the connection object that we store in an object named `salesorderheader_table`:
```{r}
salesorderheader_table <- dplyr::tbl(con, in_schema("sales", "salesorderheader")) %>%
select(-rowguid) %>%
rename(salesorderheader_details_updated = modifieddate)
```
At first glance, it _acts_ like a data frame when you print it, although it only prints 10 of the table's 31,465 rows:
```{r}
salesorderheader_table
```
However, notice that the first output line shows `??`, rather than providing the number of rows in the table. Similarly, the next to last line shows:
```
… with more rows, and 20 more variables:
```
whereas the output for a normal `tbl` of this salesorderheader data would say:
```
… with 31,455 more rows, and 20 more variables:
```
So even though `salesorderheader_table` is a `tbl`, it's **also** a `tbl_PqConnection`:
```{r}
class(salesorderheader_table)
```
It is not just a normal `tbl` of data. We can see that from the structure of `salesorderheader_table`:
```{r}
str(salesorderheader_table, max.level = 3)
```
It has only _two_ rows! The first row contains all the information in the `con` object, which contains information about all the tables and objects in the database. Here is a sample:
```{r}
salesorderheader_table$src$con@typnames$typname[387:418]
```
The second row contains a list of the columns in the `salesorderheader` table, among other things:
```{r}
salesorderheader_table$ops$x$vars
```
`salesorderheader_table` holds information needed to get the data from the 'salesorderheader' table, but `salesorderheader_table` does not hold the data itself. In the following sections, we will examine more closely this relationship between the `salesorderheader_table` object and the data in the database's 'salesorderheader' table.
## Disconnect from the database and stop Docker
```{r}
dbDisconnect(con)
# or if using the connections package, use:
# connection_close(con)
sp_docker_stop("adventureworks")
```