Seeding Data Into MongoDB Using Docker

Tutorial

Published by

Luis Osta

featured image Seeding Data Into MongoDB Using Docker

Modern applications are, at least to some extent, data-rich. What this means is that often times applications will have features like the Twitter Feed, aggregated statistics, friends/followers and many other features that rely on complex inter-related data.

Its this data that provides the vast majority of the application value. Twitter would be quite useless if the only thing you could do is post and could only see a handful of others' tweets.

The biggest pitfall most developers can fall to is re-using the production database for development

Due to the complexity of this data, during the early development process of an application, its tempting to use the same database for production as for development.

The thinking is, "if we need the most realistic data, what's more realistic than actual user generated data".

There are a few serious reasons why you should strongly consider not going that route:

Even in a 1-person team, re-using the same database means any glitch or bug that you create during development spread to production. Nobody thinks they'd ever accidentally nuke production until it happens to them.
Furthermore, in larger teams the state sharing (due to database re-use) becomes a larger issue. Since every developer can affect the underlying data the application uses, it becomes harder to debug and performance test specific queries and features since the application state can change at any moment without you knowing.
It ties your development to the availability and uptime of an external system. Even if your database is running on a powerful server with plenty of resources, there's no reason to add to your monthly bill or leave the chance that a developer DDOS the database.

How can generated data solve those problems?

The combination of using generated data with a locally running database will prevent any of those aforementioned problems from causing significant issues.

Since even if you do nuke the database or DDOS yourself, it's a trivial task to refresh your development environment or re-generate the data you need.

By reducing the number of external dependencies during development we increase system consistency solving debugging and isolation issues.

But the additional value gained from data generation will depend on two major factors:

The quantity of data generated determines the realism of the dataset and what issues may not be visible during development
How the data is generated determines affects the initial start time of the database and the long-term maintainability and modifiability of the data generation process.
1. (ie if the scripts are a mess and difficult to use they'll cause more problems than they're worth)

What are common approaches to getting data during development?

So what should we as developers do instead?

So in order to have a setup that maximizes your chances of catching bugs and testing the real quality of the software being developed the data powering the application and its usage must follow the following rules:

The data should be generated on the developer's computer and stored on a locally running database. This is to prevent any individual's developers bugs and problems to ripple out into other developer's machines.
The amount of generated data should be large enough to make performance and UI issues clear but not so large as to significantly slow down development
The generated data should only be available on runtime, and should not be committed into the actual codebase

Our Requirements

Node & NPM
Docker - You can simply install them directly from the documentation if you're on Mac or Linux. If you're on Windows I went in-depth into your options on getting Docker on the previous article.
Basic Knowledge Of React, Node and MongoDB. But the files are provided if you're not familiar with React or Node (or both). The focus will be on the data generation for MongoDB.

Foundations

For this article, we will create a simple React application that will simply render a list of employees. Specifically it will display:

The employee name
The employee title
The department the employee works in
The date they joined the company

Then, since the front-end needs to display the list of employees, the API will return an array of objects. The object will have each aforementioned property for all of users stored on the DB.

Since we're focusing on the database side, we'll breeze through the rest of the applications but in a future article I'll be diving deeper into how to get complex orchestrations with Docker Compose.

Client

For our front-end client, we'll be utilizing React to display the data stored on the MongoDB database. The client will make requests using axios to the Node API.

To get started, we will utilize Create React App to setup our baseline application that we'll make a few changes to.

You can create a CRA application with the following command from the project root:

1npx create-react-app client

Client Dependencies

Then, we will have to download the dependencies that we'll we need for the React application. For our purposes we're only going to need axios and material-ui.

You can download them both with the following command (make sure you're in the client direction not the project root)

1npm i axios @material-ui/core --save

Getting Started On The Client

For our purposes we will only be making changes to the App.js file, which in the starter project is major component that displays content.

This is what that file should look like at the start:

1import React from "react"
2import logo from "./logo.svg"
3import "./App.css"
4
5function App() {
6  return (
7    <div className="App">
8      <header className="App-header">
9        <img src={logo} className="App-logo" alt="logo" />
10        <p>
11          Edit <code>src/App.js</code> and save to reload.
12        </p>
13        <a
14          className="App-link"
15          href="https://reactjs.org"
16          target="_blank"
17          rel="noopener noreferrer"
18        >
19          Learn React
20        </a>
21      </header>
22    </div>
23  )
24}
25
26export default App

The changes we will make to this file are in order:

Remove the all of the children of the <header> HTML tag
Create a custom React Hook to call the API endpoint /api/employees and return the appropriate array of data
Add a component within the <header> tag that will render each element of the array as a card

After the three steps, your App.js file will look something like this:

1import React, { useState, useEffect } from "react"
2import { Card, Grid, Typography, makeStyles } from "@material-ui/core"
3import axios from "axios"
4import "./App.css"
5
6const useEmployees = () => {
7  const [employees, setEmployees] = useState([])
8
9  useEffect(() => {
10    const handleAPI = async () => {
11      const { data } = await axios.get("/api/employees")
12      const newEmployees = data.employees || []
13      setEmployees(newEmployees)
14    }
15
16    handleAPI()
17  }, [])
18  return employees
19}
20
21const useStyles = makeStyles((theme) => ({
22  card: {
23    padding: theme.spacing(5),
24  },
25}))
26
27function App() {
28  const employees = useEmployees()
29  const classes = useStyles()
30  return (
31    <div className="App">
32      <header className="App-header">
33        <Grid container direction="column" spacing={2} alignItems="center">
34          {employees.map((value, index) => {
35            const { name, title, department, joined } = value
36            const key = `${name}-${index}`
37            return (
38              <Grid item key={key}>
39                <Card raised className={classes.card}>
40                  <Typography variant="h4">{name}</Typography>
41                  <Typography variant="subtitle1" align="center">
42                    {title} • {department}
43                  </Typography>
44                  <Typography variant="body1">
45                    {name} has been at the company since {joined}
46                  </Typography>
47                </Card>
48              </Grid>
49            )
50          })}
51        </Grid>
52      </header>
53    </div>
54  )
55}
56
57export default App

The styling and components we used will result in the cards looking like this(Note the black background is the CRA default background not the actual card):

Draft%201%208b8b19d8abba40dc996c7b949920b428/Untitled.png

You'll be able to see it for yourself once we have wired up the API and implemented the data generation.

Client Dockerfile

The last step we need to finish the client-side portion of our small application is to create the Dockerfile.dev file that will be utilized by Docker Compose to run the React application.

Here it is, we just have to install the necessary dependencies into the image and then run the development server as normal

1FROM node:10-alpine
2WORKDIR /app
3
4COPY package.json .
5RUN npm update
6RUN NODE_ENV=development npm install
7
8COPY . .
9
10CMD ["npm", "run", "start"]

API

On the API, we'll have a single unauthenticated route named /employees which will return an array of objects containing the properties we defined above.

The folder structure for the api will ultimately end up looking like this:

1api/
2 node_modules/
3 src/
4   models/
5     User.js
6   index.js
7 Dockerfile.dev
8 package-lock.json
9 package.json

The User.js model will contain a simple Mongoose model which we'll use to interface with the database when querying for the list of employees.

API Dependencies

Then, we will have to download the necessary dependencies to quickly make a web server and integrate with a MongoDB server. Specifically we'll utilize Express, Mongoose and Nodemon.

The first two we'll download as regular dependencies with the following command (make sure you're in the api directory and not in the project root):

1npm i express mongoose --save

Then nodemon we will install as a development dependency

1npm i nodemon --save-dev

Once you have your dependencies downloaded make sure to add the 'nodemon' prefix to your npm start script. Your "start" script in the package.json should look like this:

1"start": "nodemon src/index.js"

Getting Started On The API

First let's build out the User mongoose model, in the User.js file in the models folder the User model can be created like this:

Employee.js

1const mongoose = require("mongoose");
2const { Schema } = mongoose;
3
4const EmployeeSchema = new Schema({
5  name: String,
6  title: String,
7  department: String,
8  joined: Date,
9});
10
11const Employee = mongoose.model("employee", EmployeeSchema);
12
13module.exports = Employee;

Where we the 'mongoose.model' function registers it into mongoose as long as we require the file in our index.js.

Then our index.js file we require the User model, create a basic express server and have our singular route the GET /employees route.

index.js

1const express = require("express")
2const mongoose = require("mongoose")
3require("./models/Employee")
4const Employee = mongoose.model("employee")
5const PORT = 8080 || process.env.PORT
6const MONGO_URI = process.env.MONGO_URI || ""
7const app = express()
8
9app.get("/employees", async (req, res) => {
10  const employees = await Employee.find()
11  res.status(200).send({ employees })
12})
13
14mongoose.connect(MONGO_URI, {
15  useNewUrlParser: true,
16  useUnifiedTopology: true,
17  useFindAndModify: true,
18})
19
20app.listen(PORT, () => {
21  console.log(`MONGO URI ${MONGO_URI}`)
22  console.log(`API listening on port ${PORT}`)
23})

API Dockerfile

The API Dockerfile will be look exactly the same to the Client Dockerfile since we've updated the package.json file to abstract away the functionality the API needs.

Dockerfile.dev

1FROM node:10-alpine
2WORKDIR /app
3
4COPY ./package.json ./
5RUN npm update
6RUN NODE_ENV=development npm install
7
8COPY . .
9CMD ["npm", "run", "start"]

NGINX

From the project root, create a folder named nginx, which will contain the configuration of an NGINX server that will route the requests either to the React application or the Nodejs API.

The following is the nginx configuration will that you should name nginx.conf. It defines the upstreams servers for the client and server.

nginx.conf

1upstream client {
2    server client:3000;
3}
4
5upstream api {
6    server api:8080;
7}
8
9server {
10    listen 80;
11
12    location / {
13        proxy_pass http:/;
14    }
15
16    location /sockjs-node {
17        proxy_pass http:/;
18        proxy_http_version 1.1;
19        proxy_set_header Upgrade $http_upgrade;
20        proxy_set_header Connection "Upgrade";
21    }
22
23    location /api {
24        rewrite /api/(.*) /$1 break;
25        proxy_pass http://api;
26    }
27}

The blocks for sockjs-node are there to allow for the websocket connection that CRA utilizes during development.

We also need to create a Dockerfile for the NGINX server that uses our config file to override the default. Make sure to create the Dockefile in the same folder as the config file.

Dockerfile.dev

1FROM nginx
2COPY ./nginx.conf /etc/nginx/conf.d/default.conf

Docker Compose

We won't be going too deeply into how Compose works in this article, but suffice it to say that it ties together the individual containers we defined above.

docker-compose.yml

1version: "3"
2services:
3  client:
4    build:
5      context: "."
6      dockerfile: "Dockerfile.dev"
7    stdin_open: true # fixes the auto exit issue: https://github.com/facebook/create-react-app/issues/8688
8    volumes:
9      - ./src:/app/src
10  api:
11    build:
12      context: "./api"
13      dockerfile: "Dockerfile.dev"
14    volumes:
15      - ./api/src:/app/src
16    environment:
17      - MONGO_URI="mongodb://mongo:27017"
18  nginx:
19    restart: always
20    depends_on:
21      - api
22      - client
23    build:
24      context: ./nginx
25      dockerfile: Dockerfile
26    ports:
27      - "3050:80"
28  mongo:
29    image: "mongo:latest"
30    ports:
31      - "27017:27017"
32  dbseed:
33    build:
34      context: ./mongo
35      dockerfile: Dockerfile.dev
36    links:
37      - mongo

Towards the bottom of the docker-compose.yml file you'll see the services for the MongoDB Database and the container that will seed the aforementioned database.

Now that we've finished defining the foundations of the application, we will move on to creating the mongo directory where will we define the Dockerfile for the dbseed service and the scripts for generating data.

Data Generation

Before defining the database seeding container, first we'll focus on the actual data generation for development data.

The folder structure of the data generation script and DB seeding container will match the following:

1mongo/
2 node_modules/
3 scripts/
4  index.js
5  employees.js
6 Dockerfile.dev
7 init.sh
8 package.json
9 package-lock.json

The scripts will output a array of JSONs which will be imported into the database. Then the bash file, init.sh, will handle import the generated data into the running database.

Data Gen Dependencies

As part of the data generation scripts, we only utilize two NPM libraries, these are yargs and faker respectively. Which can be downloaded by running the following command

1npm i faker yargs --save

These libraries make it incredibly simple to generate fake data and to handle CLI in JS files respectively.

Our Data Generation Scripts

We will have two main files we'll need for the data generation, these are a index.js file which will serve as our point of contact for the data generation and the employee.js which will hold all of the data generation functions needed for employees

index.js

1const yargs = require("yargs")
2const fs = require("fs")
3const { generateEmployees } = require("./employees")
4const argv = yargs
5  .command("amount", "Decides the number of claims to generate", {
6    amount: {
7      description: "The amount to generate",
8      alias: "a",
9      type: "number",
10    },
11  })
12  .help()
13  .alias("help", "h").argv
14
15if (argv.hasOwnProperty("amount")) {
16  const amount = argv.amount
17  const employees = generateEmployees(amount)
18
19  const jsonObj = JSON.stringify(employees)
20  fs.writeFileSync("employeedata.json", jsonObj)
21}

employees.js

1const faker = require("faker")
2const localuser = require("../localuser.json")
3const generateEmployees = (amount) => {
4  let employees = []
5  for (x = 0; x < amount; x++) {
6    employees.push(createEmployee())
7  }
8  employees.push(createEmployee(localuser))
9  return employees
10}
11
12const createEmployee = (user) => {
13  const companyDepartments = [
14    "Marketing",
15    "Finance",
16    "Operations Management",
17    "Human Resources",
18    "IT",
19  ]
20  const employeeDepartment =
21    companyDepartments[Math.floor(Math.random() * companyDepartments.length)]
22  const employee = {
23    name: faker.name.findName(),
24    title: faker.name.jobTitle(),
25    department: employeeDepartment,
26    joined: faker.date.past(),
27    ...user,
28  }
29  console.log(user)
30  return employee
31}
32
33module.exports = {
34  generateEmployees,
35}

This script is then called by the aforementioned inith.sh, which is a simple bash file that posts the mongoimport CLI command.

init.sh

1#!/bin/sh
2mongoimport --collection employees --file employeedata.json --jsonArray --uri "mongodb://mongo:27017"

Database Seeding Container

Now that we've defined the scripts to generate and import the data, we can define the Dockerfile that will be utilized by Docker Compose.

Specifically, we will utilize a multi-stage build to first generate the data, and then move the data from the generator container and then utilizes it in a mongo container which then executes the init.sh bash script.

Dockerfile.dev

1FROM node:10-alpine as generator
2WORKDIR /data
3COPY . .
4RUN npm install
5RUN node ./scripts/index.js --amount 10
6
7FROM mongo:latest
8
9COPY . .
10COPY --from=generator ./data/ .
11RUN ["chmod", "+x", "init.sh"]
12CMD ./init.sh

Things To Keep In Mind When Generating Data

When generating development data for a MongoDB database there are three primary concerns that must be considered:

DB Import Method
1. For our case mongoimport vs mongorestore
Predefined Data Vs Randomly Generated
Inter-collection relationships
1. Like the uniqueIds of one collection been utilized in another collection

Within this article we will only have to consider the first one but the other two we will cover and discuss

Importing Data Into MongoDB

There are two major methods via CLI to import data into a running MongoDB database. These are the mongoimport and mongorestore. The primary difference between these two types of import methods is the data types they work with and the metadata they preserve.

Specifically, mongorestore only works with BSON data, this allows it to run faster and preserve the metadata BSON provides.

This is possible because unlike mongoimport, mongorestore doesn't have to convert the data from JSON into BSON.

This conversion process doesn't guarantee that the rich data types provided by BSON are maintained in the import process. Hence, why mongoimport isn't recommended for usage in production systems

Why not go with mongorestore

Mongorestore is:

Faster than mongoimport
Preserves all of the metadata

But the reason why it'd advise to instead utilize mongoimport for development data is the simplicity it provides.

Due to the flexibility of data it can receive, mongoimport is significantly easier to use compared to mongorestore. Unlikes its faster alternative, mongoimport can directly import both JSON and CSV.

This allows us to write a simple script to generate an array of JSONs which can be easily imported as so

1mongoimport --collection employees --file employeedata.json --jsonArray --uri "mongodb://mongo:27017"

Predefined Data Alongside Faked Data

There may be times where the generated data used for developmend should be related to developer-dependent information. For example, the developer has a specific logon (username and userId) and the generated data is user specific.

Hence, in order for the developer to have the data generated for their specific account, there should be an optional json that is only locally defined.

We can achieve this by creating a JSON file in the same folder as the data generation scripts. For example:

localuser.json

1{
2_id: "<Unique Identifier>",
3name: "<User Name>"
4}

Which can then be imported and used by the general data generation script as such:

1const faker = require("faker")
2const localuser = require("../localuser.json")
3const generateEmployees = (amount) => {
4  let employees = []
5  for (x = 0; x < amount; x++) {
6    employees.push(createEmployee())
7  }
8  employees.push(createEmployee(localuser.name))
9  return employees
10}
11
12const createEmployee = (name) => {
13  const companyDepartments = [
14    "Marketing",
15    "Finance",
16    "Operations Management",
17    "Human Resources",
18    "IT",
19  ]
20  const employeeDepartment =
21    companyDepartments[Math.floor(Math.random() * companyDepartments.length)]
22  const employee = {
23    name: name ? name : faker.name.findName(),
24    title: faker.name.jobTitle(),
25    department: employeeDepartment,
26    joined: faker.date.past(),
27  }
28
29  return employee
30}
31
32module.exports = {
33  generateEmployees,
34}

Here you can see how we can import the localuser and then create an employee based off the provided data. In this situation we could also use destructuring to provide an easier way to override the generated data with an arbitrary number of properies. Like this:

1const createEmployee = (user) => {
2  const companyDepartments = [
3    "Marketing",
4    "Finance",
5    "Operations Management",
6    "Human Resources",
7    "IT",
8  ]
9  const employeeDepartment =
10    companyDepartments[Math.floor(Math.random() * companyDepartments.length)]
11  const employee = {
12    name: faker.name.findName(),
13    title: faker.name.jobTitle(),
14    department: employeeDepartment,
15    joined: faker.date.past(),
16    ...user,
17  }
18
19  return employee
20}

But do note that the JSON key must match the properties defined in the 'employee' object. So to override the title and name property the localuser.json must look like this:

1{
2  "name": "Jane Doe",
3  "title": "Senior Software Engineer"
4}

Inter-collection relationships

Let's say that the company that all of our employees are a part of gives each employee a computer. In such a case we would want to keep track of each computer the company owns and the employee who currently has it.

Its schema would look a bit like this(Ignored the overly simplistic examle):

1{
2computerName: String,
3employeeName: String
4}

Hence, if we wanted to generate data for the computers the company owns we would have to utilize the names of the employees we generated.

This inter-collection example uses a computer schema that isn't how it would actually be done in real life. This would probably make more sense as an embedded document within an employee document. This example is just used for simplicity's sake.

We can do this by simply passing down the array of employees generated to the function that generates the computers.

This would look roughly like this:

1const yargs = require("yargs")
2const fs = require("fs")
3const { generateEmployees } = require("./employees")
4const { generateComputers } = require("./computers")
5
6const argv = yargs
7  .command("amount", "Decides the number of claims to generate", {
8    amount: {
9      description: "The amount to generate",
10      alias: "a",
11      type: "number",
12    },
13  })
14  .help()
15  .alias("help", "h").argv
16
17if (argv.hasOwnProperty("amount")) {
18  const amount = argv.amount
19  const employees = generateEmployees(amount)
20  const computers = generateComputers(amount, employees)
21
22  const jsonObj = JSON.stringify(employees)
23  fs.writeFileSync("employeedata.json", jsonObj)
24  const computerObj = JSON.stringify(computers)
25  fs.writeFileSync("computerdata.json", computerObj)
26}

Where generateComputers is a function similar to generateEmployees but takes an extra parameter that holds the data that belongs to a separate collection.

Conclusion

Congrats!! Now everything you need has been hooked together and the data you need should be in the database.

You can go to localhost:3050 and should see something like this:

Draft%201%208b8b19d8abba40dc996c7b949920b428/Untitled%201.png

With all of the names, titles, departments, etc (except for the one specified in localuser.json) being randomly generated.

The final big-picture folder structure of the application should look kinda like this:

1api/
2client/
3mongo/
4nginx/
5docker-compose.yml

You can checkout the Github repository to double check against your version if you're having any issues.

Future Steps

Integrate the scripts with Typescript types used by the Mongoose Schema
Instead of exporting JSON by the script, have the script export in BSON instead of JSON

Get our latest software insights to your inbox

More insights you might be interested in

Balanced Analysis

Are Certifications Worth It?

Explore various perspectives surrounding the controversial issue of the value of certificates in the tech community. Get an in-depth rundown of the pitfalls that arise from the use of professional certificates and how to use certificates optimally.

Why Containerize Your Gatsby Application?

Tutorial

Why Containerize Your Gatsby Application?

Learn why you should containerize you gatsby application using docker and how to do so with easy to follow steps regardless of the Operating System you are using.

Building Scalable Styling Architecture in React

Architecture Breakdowns

Building Scalable Styling Architecture in React

Explore the principles that create scalable styling architecture in react and learn practices that can achieve this desired architecture.

Our Failed Migration To Netlify CMS: Post Mortem & Lessons Learned

Deep Dive

Our Failed Migration To Netlify CMS: Post Mortem & Lessons Learned

A retrospective outlook on our process of migrating to Netlify CMS and the challenges we faced.