Database as an API data store
Soon after we acquire more users or more data, there are problems no data engineer wants to see. How we can make our lives easier by preparing ahead?
Why do we need databases?
Our small CSV based Flask application proves that we can build an API with all the operations needed for exposing the data to users and applications. However, this app would most likely break down if multiple users tried to change the data at the same time. Before you ask, yes, we could have upgraded the application to have a mechanism to allow multiple users - but we would spend a lot of time building something from scratch, something that databases are made for. Out of the sea of databases, we will pick a tried out, lightweight solution that enables all of our business requirements and even introduces some cool extra features.
To simplify the compatibility with different infrastructure setups, we will implement the db in Docker which is a platform for running containers as processes on a host. In our example, our host will be our PC but you can use a different infrastructure and the recipes should work - one of the benefits of using a container architecture. It will also help us with deployments. Install Docker here. If you installed everything correctly, Docker CLI should work and you can check it by running:
Initializing the database container
We will use the PostgreSQL DB. It already has a prepared, easily downloadable, image and a simple deployment. All docker commands will start with the keyword docker. You start the database with:
docker run -p 5432:5432 --name postgres-db -v data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=mysecretpassword -d postgres
The docker run command specifies that we want to instantiate a new isolated container. The name of the container will be postgres-db. If you do not name it, every time you rerun, the container docker will generate a new name. The explicit naming will come in handy later. By default, all container ports are non-accessible from your PC. The -p flag exposes a container’s internal port in the format hostPort:containerPort. It means we will access the database on host port 5432. The -v flag will mount a volume from the container to the host disk. The volume is a mechanism for persisting data which will be stored in the database to a file or files on our PC. Without it, any time we would restart the container - we would lose all the data. The format we are using is [volume_name]:[path_where_the_directory_is_mounted_in_the_container]. The idea is that you predefine the volume first and then start the container, but we skipped this process. If you start a container with a volume that does not exist yet, docker will create the volume. The flag -e sets the environment variables. In our case, we will download a pre-prepared postgres image which expects us to set up an initial password for the database. The -d flag runs the container in detached mode, which means after we execute the command, we will be back in our command prompt and the container will be running in a background process. The alternative is to run it in interactive (-it) mode which will open a shell directly in the container. The last parameter is the name of the image (postgres) for which docker will search its image store. The image is a docker official image and more info can be found here. By running:
we can check if our container process is running along with information like container_id (in my case 58ac91842020, but it will be different for you), the image we have used (postgres), status, the ports we have exposed and the name of the container. In case of any issues with the status, we can check the logs with:
docker logs 58ac91842020
docker logs postgres-db
since we explicitly named the container during execution of the run command.
Accessing the database over a client
The next step is to check the database and prepare a table. You don’t need to install a special client with a GUI for this step, but usually data engineers have a tool that allows them to connect to multiple different databases with a pretty GUI to execute queries, edit stored procedures etc. One such client is DBeaver. On the Mac you can install it with:
brew cask install dbeaver-community
For another OS check this link here.
When you open up DBeaver, there should be a Databases window on the left side. Click on the New Database button, A new window should open and it should look like this:
Choose PostgreSQL and click next. If there are any missing drivers, DBeaver will download them for you. Remember that we deployed the database locally on port 5432. On the next window, add localhost as host, 5432 as port, and postgres as database. You still need to add a username and password, which we defined as postgres and mysecretpassword. Click on Test Connection and if everything is ok, you should get something like this:
Click Finish and double click on the connection on the left side of the main window. The database we created is named postgres and by default it has 3 schemas:information_schema, pg_catalog and public. We will use public for now. The database tree should look like this:
If you remember, in our first blog post the data we used were daily statistics of COVID cases generated by Our World in Data. The data was maintained on this github. The actual CSV was this file, but somewhat simplified for our API. Now we will use the full file with all columns. In order to store the data in the database, we need to create a table with the same structure as the data in CSV. We can do this using the DBeaver client, but let’s first try to do it by connecting to the db container. If you are in a hurry, skip this paragraph, go to “Table creation” and run command through DBeaver.
Accessing the container
With the command:
docker exec -it postgres-api psql -U postgres
we can connect into the db container and run the Postgres command line. The docker exec command runs commands in the container. We specified the flag -it, which means we will be running in interactive mode in order to try out some commands with the Postgres command tool. We will be using the db container, which we named postgres-db. Now comes the part of the exec command which actually specifies what we want to execute, and we want to run the Postgres command shell with the parameter -U which specifies that the user we are using is postgres. We don’t have to specify the password because the super user postgres is part of a trust authentication and is used to set up the db through the docker interface. When the command line initializes, write
This will print all available databases. You should see the default postgres database and possibly some template databases depending on the distribution we picked. Let’s switch to the postgres database with the command:
Let’s print the current schemas. Run query:
SELECT schema_name FROM information_schema.schemata;
You should see something like: schema_name -------------------- pg_toast pg_catalog public information_schema (4 rows)
We will use the public schema. Let’s review: we can connect to the database instance (or server in the true postgres nomenclature) over our local client and through the psql client located on the container. The database postgres exists and it contains the public schema.
Let’s create the table. Run:
CREATE TABLE public.owid_covid_data ( iso_code varchar(20) NOT NULL, continent varchar(20) NOT NULL, "location" varchar(50) NOT NULL, "date" date NOT NULL, total_cases numeric(16,2) NULL, new_cases numeric(16,2) NULL, new_cases_smoothed numeric(16,2) NULL, total_deaths numeric(16,2) NULL, new_deaths numeric(16,2) NULL, new_deaths_smoothed numeric(16,2) NULL, total_cases_per_million numeric(16,2) NULL, new_cases_per_million numeric(16,2) NULL, new_cases_smoothed_per_million numeric(16,2) NULL, total_deaths_per_million numeric(16,2) NULL, new_deaths_per_million numeric(16,2) NULL, new_deaths_smoothed_per_million numeric(16,2) NULL, icu_patients numeric(16,2) NULL, icu_patients_per_million numeric(16,2) NULL, hosp_patients numeric(16,2) NULL, hosp_patients_per_million numeric(16,2) NULL, weekly_icu_admissions numeric(16,2) NULL, weekly_icu_admissions_per_million numeric(16,2) NULL, weekly_hosp_admissions numeric(16,2) NULL, weekly_hosp_admissions_per_million numeric(16,2) NULL, total_tests numeric(16,2) NULL, new_tests numeric(16,2) NULL, total_tests_per_thousand numeric(16,2) NULL, new_tests_per_thousand numeric(16,2) NULL, new_tests_smoothed numeric(16,2) NULL, new_tests_smoothed_per_thousand numeric(16,2) NULL, tests_per_case numeric(16,2) NULL, positive_rate numeric(16,2) NULL, tests_units varchar(50) NULL, stringency_index numeric(16,2) NULL, population numeric(16,2) NULL, population_density numeric(16,2) NULL, median_age numeric(16,2) NULL, aged_65_older numeric(16,2) NULL, aged_70_older numeric(16,2) NULL, gdp_per_capita numeric(16,2) NULL, extreme_poverty numeric(16,2) NULL, cardiovasc_death_rate numeric(16,2) NULL, diabetes_prevalence numeric(16,2) NULL, female_smokers numeric(16,2) NULL, male_smokers numeric(16,2) NULL, handwashing_facilities numeric(16,2) NULL, hospital_beds_per_thousand numeric(16,2) NULL, life_expectancy numeric(16,2) NULL, human_development_index numeric(16,2) null, CONSTRAINT owid_covid_data_pk PRIMARY KEY (iso_code, continent, location, date) );
Loading the data
The next step is to populate the table with the data. We will also do this by connecting to the container. First, we need to copy the COVID stats csv file into the container with the command:
docker cp owid-covid-data.csv postgres-api:/
We are using the copy command with the first parameter being the file name we want to copy. Make sure that you are located in the directory where the file is located. The second parameter is the name of the container to which we want to transfer the file. The format for the parameter is CONTAINER: DEST_PATH. We specified / for the destination path which means it will store it in the root folder on the container. Let’s check the file. Run the command:
docker exec -it postgres-api /bin/bash
Again, we are running the exec command in interactive mode. The first parameter is the container name, the second is the command we want to run when connected to the container which in our case will be the bash shell. After running the command, we should see the prompt specifying we are connected as root. A simple ls command will show us if we copied the file properly. Let’s connect to psql command line again with the command:
psql -U postgres
Unfortunately, our file is not properly formatted for a simple data transfer, so we will need to clean it. You do not want to do it manually by editing the file, as it gets really complicated when the number of rows and columns is large. We will create a staging table for our data and then clean it through SQL. Run the command:
CREATE TABLE public.owid_covid_data_stg ( iso_code varchar(50) NULL, continent varchar(20) NULL, "location" varchar(50) NULL, "date" varchar(50) NOT NULL, total_cases varchar(50) NULL, new_cases varchar(50) NULL, new_cases_smoothed varchar(50) NULL, total_deaths varchar(50) NULL, new_deaths varchar(50) NULL, new_deaths_smoothed varchar(50) NULL, total_cases_per_million varchar(50) NULL, new_cases_per_million varchar(50) NULL, new_cases_smoothed_per_million varchar(50) NULL, total_deaths_per_million varchar(50) NULL, new_deaths_per_million varchar(50) NULL, new_deaths_smoothed_per_million varchar(50) NULL, icu_patients varchar(50) NULL, icu_patients_per_million varchar(50) NULL, hosp_patients varchar(50) NULL, hosp_patients_per_million varchar(50) NULL, weekly_icu_admissions varchar(50) NULL, weekly_icu_admissions_per_million varchar(50) NULL, weekly_hosp_admissions varchar(50) NULL, weekly_hosp_admissions_per_million varchar(50) NULL, total_tests varchar(50) NULL, new_tests varchar(50) NULL, total_tests_per_thousand varchar(50) NULL, new_tests_per_thousand varchar(50) NULL, new_tests_smoothed varchar(50) NULL, new_tests_smoothed_per_thousand varchar(50) NULL, tests_per_case varchar(50) NULL, positive_rate varchar(50) NULL, tests_units varchar(50) NULL, stringency_index varchar(50) NULL, population varchar(50) NULL, population_density varchar(50) NULL, median_age varchar(50) NULL, aged_65_older varchar(50) NULL, aged_70_older varchar(50) NULL, gdp_per_capita varchar(50) NULL, extreme_poverty varchar(50) NULL, cardiovasc_death_rate varchar(50) NULL, diabetes_prevalence varchar(50) NULL, female_smokers varchar(50) NULL, male_smokers varchar(50) NULL, handwashing_facilities varchar(50) NULL, hospital_beds_per_thousand varchar(50) NULL, life_expectancy varchar(50) NULL, human_development_index varchar(50) NULL );
You can see that we have specified all data types as varchars. This will help us load the data and parse it properly to be loaded into the main table. Now, run the command for loading the data from CSV into the table.
COPY owid_covid_data_stg FROM '/owid-covid-data.csv' WITH (HEADER, FORMAT csv);
This is a native postgres command which has the table name after the COPY keyword. The filename comes after the FROM keyword. After WITH we specify that the CSV has the header and we do not want to load it and that the format of the file is CSV.
As we can see from the data, there are records inside the CSV which designate the statistics for the whole world. They do not have the iso_code and continent values specified which makes problems with our primary key. We can change the primary key or opt to leave out the data. Let’s ignore those records and only send statistics by country to our main table as this will simplify our future API. Run:
INSERT INTO public.owid_covid_data select iso_code, continent, "location", date("date"), cast(total_cases as numeric), cast(new_cases as numeric), cast(new_cases_smoothed as numeric), cast(total_deaths as numeric), cast(new_deaths as numeric), cast(new_deaths_smoothed as numeric), cast(total_cases_per_million as numeric), cast(new_cases_per_million as numeric), cast(new_cases_smoothed_per_million as numeric), cast(total_deaths_per_million as numeric), cast(new_deaths_per_million as numeric), cast(new_deaths_smoothed_per_million as numeric), cast(icu_patients as numeric), cast(icu_patients_per_million as numeric), cast(hosp_patients as numeric), cast(hosp_patients_per_million as numeric), cast(weekly_icu_admissions as numeric), cast(weekly_icu_admissions_per_million as numeric), cast(weekly_hosp_admissions as numeric), cast(weekly_hosp_admissions_per_million as numeric), cast(total_tests as numeric), cast(new_tests as numeric), cast(total_tests_per_thousand as numeric), cast(new_tests_per_thousand as numeric), cast(new_tests_smoothed as numeric), cast(new_tests_smoothed_per_thousand as numeric), cast(tests_per_case as numeric), cast(positive_rate as numeric), tests_units, cast(stringency_index as numeric), cast(population as numeric), cast(population_density as numeric), cast(median_age as numeric), cast(aged_65_older as numeric), cast(aged_70_older as numeric), cast(gdp_per_capita as numeric), cast(extreme_poverty as numeric), cast(cardiovasc_death_rate as numeric), cast(diabetes_prevalence as numeric), cast(female_smokers as numeric), cast(male_smokers as numeric), cast(handwashing_facilities as numeric), cast(hospital_beds_per_thousand as numeric), cast(life_expectancy as numeric), cast(human_development_index as numeric) from public.owid_covid_data_stg where iso_code is not null and continent is not null;
You should now see the data in the table owid_covid_data if you check through psql or DBeaver. Let’s check by running a query to get data for Croatia:
Let’s review what we did:
we created a PostgreSQL database instance in a container through Docker
we used a persistent volume to store data which is located on our PC
we defined the data model through source data (CSV)
we deployed the data model in psql by connecting through the docker interface to the container
we loaded the data through the docker interface to the container
A setup like this is perfect for prototyping because we can workout a deployment process for any purpose, be it an API or a full application. But can it work in a production capacity? If we consider the DB as a building block of our full stack, by using docker, we are striving to be infrastructure-agnostic. For production we need to consider other things as well, like the simplicity of deployment, working in multiple environments, security, backups and performance. With that in mind, if we want to convert this to a production-ready store for an API, we need to consider upgrading a couple of things in the future:
On one side we have a fast deployment with one command, on the other side - for any production infrastructure, we would need to define the persistent volume first. Type and size of storage is important and we want to have full control over it.
The database is packed as an image which converts to a container when deployed and we could easily pack the rest of the stack together and fully simplify the requirements.
With such architecture, the deployment of the stack is just a set of recipes. However, containers usually have a strict access policy.
Since we would probably have multiple environments, we need to develop a migration strategy for changes in the data model.
We obviously cannot keep loading the data, either for initial load or ad-hoc changes, through the docker container interface. If our use-case is API, then the API should be the only entry point for the data.
Part 3 of our blog series will explore the scalable API which uses our containerized database.
Subscribe to our newsletter
We send bi-weekly blogs on design, technology and business topics.