Luis Osta

Lead Developer @ Valencian Digital@LuisOsta · June 20, 2020

Seeding Data Into MongoDB Using Docker

Learn how to seed data into a running Docker container in a simple and flexible way.

Modern applications are, at least to some extent, data-rich. What this means is that often times applications will have features like the Twitter Feed, aggregated statistics, friends/followers and many other features that rely on complex inter-related data.

Its this data that provides the vast majority of the application value. Twitter would be quite useless if the only thing you could do is post and could only see a handful of others' tweets.

The biggest pitfall most developers can fall to is re-using the production database for development

Due to the complexity of this data, during the early development process of an application, its tempting to use the same database for production as for development.

The thinking is, "if we need the most realistic data, what's more realistic than actual user generated data".

There are a few serious reasons why you should strongly consider not going that route:

  1. Even in a 1-person team, re-using the same database means any glitch or bug that you create during development spread to production. Nobody thinks they'd ever accidentally nuke production until it happens to them.
  2. Furthermore, in larger teams the state sharing (due to database re-use) becomes a larger issue. Since every developer can affect the underlying data the application uses, it becomes harder to debug and performance test specific queries and features since the application state can change at any moment without you knowing.
  3. It ties your development to the availability and uptime of an external system. Even if your database is running on a powerful server with plenty of resources, there's no reason to add to your monthly bill or leave the chance that a developer DDOS the database.

How can generated data solve those problems?

The combination of using generated data with a locally running database will prevent any of those aforementioned problems from causing significant issues.

Since even if you do nuke the database or DDOS yourself, it's a trivial task to refresh your development environment or re-generate the data you need.

By reducing the number of external dependencies during development we increase system consistency solving debugging and isolation issues.

But the additional value gained from data generation will depend on two major factors:

  1. The quantity of data generated determines the realism of the dataset and what issues may not be visible during development
  2. How the data is generated determines affects the initial start time of the database and the long-term maintainability and modifiability of the data generation process.
    1. (ie if the scripts are a mess and difficult to use they'll cause more problems than they're worth)

So what should we as developers do instead?

So in order to have a setup that maximizes your chances of catching bugs and testing the real quality of the software being developed the data powering the application and its usage must follow the following rules:

  1. The data should be generated on the developer's computer and stored on a locally running database. This is to prevent any individual's developers bugs and problems to ripple out into other developer's machines.
  2. The amount of generated data should be large enough to make performance and UI issues clear but not so large as to significantly slow down development
  3. The generated data should only be available on runtime, and should not be committed into the actual codebase

Our Requirements

  1. Node & NPM
  2. Docker - You can simply install them directly from the documentation if you're on Mac or Linux. If you're on Windows I went in-depth into your options on getting Docker on the previous article.
  3. Basic Knowledge Of React, Node and MongoDB. But the files are provided if you're not familiar with React or Node (or both). The focus will be on the data generation for MongoDB.

Foundations

For this article, we will create a simple React application that will simply render a list of employees. Specifically it will display:

  1. The employee name
  2. The employee title
  3. The department the employee works in
  4. The date they joined the company

Then, since the front-end needs to display the list of employees, the API will return an array of objects. The object will have each aforementioned property for all of users stored on the DB.

Since we're focusing on the database side, we'll breeze through the rest of the applications but in a future article I'll be diving deeper into how to get complex orchestrations with Docker Compose.

Client

For our front-end client, we'll be utilizing React to display the data stored on the MongoDB database. The client will make requests using axios to the Node API.

To get started, we will utilize Create React App to setup our baseline application that we'll make a few changes to.

You can create a CRA application with the following command from the project root:

npx create-react-app client

Client Dependencies

Then, we will have to download the dependencies that we'll we need for the React application. For our purposes we're only going to need axios and material-ui.

You can download them both with the following command (make sure you're in the client direction not the project root)

npm i axios @material-ui/core --save

Getting Started On The Client

For our purposes we will only be making changes to the App.js file, which in the starter project is major component that displays content.

This is what that file should look like at the start:

import React from 'react';
import logo from './logo.svg';
import './App.css';

function App() {
  return (
    <div className="App">
      <header className="App-header">
        <img src={logo} className="App-logo" alt="logo" />
        <p>
          Edit <code>src/App.js</code> and save to reload.
        </p>
        <a
          className="App-link"
          href="https://reactjs.org"
          target="_blank"
          rel="noopener noreferrer"
        >
          Learn React
        </a>
      </header>
    </div>
  );
}

export default App;

The changes we will make to this file are in order:

  1. Remove the all of the children of the <header> HTML tag
  2. Create a custom React Hook to call the API endpoint /api/employees and return the appropriate array of data
  3. Add a component within the <header> tag that will render each element of the array as a card

After the three steps, your App.js file will look something like this:

import React, { useState, useEffect } from "react";
import { Card, Grid, Typography, makeStyles } from "@material-ui/core";
import axios from "axios";
import "./App.css";

const useEmployees = () => {
  const [employees, setEmployees] = useState([]);

  useEffect(() => {
    const handleAPI = async () => {
      const { data } = await axios.get("/api/employees");
      const newEmployees = data.employees || [];
      setEmployees(newEmployees);
    };

    handleAPI();
  }, []);
  return employees;
};

const useStyles = makeStyles((theme) => ({
  card: {
    padding: theme.spacing(5),
  },
}));

function App() {
  const employees = useEmployees();
  const classes = useStyles();
  return (
    <div className="App">
      <header className="App-header">
        <Grid container direction="column" spacing={2} alignItems="center">
          {employees.map((value, index) => {
            const { name, title, department, joined } = value;
            const key = `${name}-${index}`;
            return (
              <Grid item key={key}>
                <Card raised className={classes.card}>
                  <Typography variant="h4">{name}</Typography>
                  <Typography variant="subtitle1" align="center">
                    {title}{department}
                  </Typography>
                  <Typography variant="body1">
                    {name} has been at the company since {joined}
                  </Typography>
                </Card>
              </Grid>
            );
          })}
        </Grid>
      </header>
    </div>
  );
}

export default App;

The styling and components we used will result in the cards looking like this(Note the black background is the CRA default background not the actual card):

You'll be able to see it for yourself once we have wired up the API and implemented the data generation.

Client Dockerfile

The last step we need to finish the client-side portion of our small application is to create the Dockerfile.dev file that will be utilized by Docker Compose to run the React application.

Here it is, we just have to install the necessary dependencies into the image and then run the development server as normal

FROM node:10-alpine
WORKDIR /app

COPY package.json .
RUN npm update
RUN NODE_ENV=development npm install

COPY . .

CMD ["npm", "run", "start"]

API

On the API, we'll have a single unauthenticated route named /employees which will return an array of objects containing the properties we defined above.

The folder structure for the api will ultimately end up looking like this:

api/
 node_modules/
 src/
   models/
     User.js
   index.js
 Dockerfile.dev
 package-lock.json
 package.json

The User.js model will contain a simple Mongoose model which we'll use to interface with the database when querying for the list of employees.

API Dependencies

Then, we will have to download the necessary dependencies to quickly make a web server and integrate with a MongoDB server. Specifically we'll utilize Express, Mongoose and Nodemon.

The first two we'll download as regular dependencies with the following command (make sure you're in the api directory and not in the project root):

npm i express mongoose --save

Then nodemon we will install as a development dependency

npm i nodemon --save-dev

Once you have your dependencies downloaded make sure to add the 'nodemon' prefix to your npm start script. Your "start" script in the package.json should look like this:

"start": "nodemon src/index.js"

Getting Started On The API

First let's build out the User mongoose model, in the User.js file in the models folder the User model can be created like this:

Employee.js

const mongoose = require("mongoose");
const { Schema } = mongoose;

const EmployeeSchema = new Schema({
  name: String,
  title: String,
  department: String,
  joined: Date,
});

const Employee = mongoose.model("employee", EmployeeSchema);

module.exports = Employee;

Where we the 'mongoose.model' function registers it into mongoose as long as we require the file in our index.js.

Then our index.js file we require the User model, create a basic express server and have our singular route the GET /employees route.

index.js

const express = require("express");
const mongoose = require("mongoose");
require("./models/Employee");
const Employee = mongoose.model("employee");
const PORT = 8080 || process.env.PORT;
const MONGO_URI = process.env.MONGO_URI || "";
const app = express();

app.get("/employees", async (req, res) => {
  const employees = await Employee.find();
  res.status(200).send({ employees });
});

mongoose.connect(MONGO_URI, {
  useNewUrlParser: true,
  useUnifiedTopology: true,
  useFindAndModify: true,
});

app.listen(PORT, () => {
  console.log(`MONGO URI ${MONGO_URI}`);
  console.log(`API listening on port ${PORT}`);
});

API Dockerfile


The API Dockerfile will be look exactly the same to the Client Dockerfile since we've updated the package.json file to abstract away the functionality the API needs.

Dockerfile.dev

FROM node:10-alpine
WORKDIR /app

COPY ./package.json ./
RUN npm update
RUN NODE_ENV=development npm install

COPY . .
CMD ["npm", "run", "start"]

NGINX

From the project root, create a folder named nginx, which will contain the configuration of an NGINX server that will route the requests either to the React application or the Nodejs API.

The following is the nginx configuration will that you should name nginx.conf. It defines the upstreams servers for the client and server.

nginx.conf

upstream client {
    server client:3000;
}

upstream api {
    server api:8080;
}

server {
    listen 80;

    location / {
        proxy_pass http://client;
    }

    location /sockjs-node {
        proxy_pass http://client;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "Upgrade";
    }

    location /api {
        rewrite /api/(.*) /$1 break;
        proxy_pass http://api;
    }
}

The blocks for sockjs-node are there to allow for the websocket connection that CRA utilizes during development.

We also need to create a Dockerfile for the NGINX server that uses our config file to override the default. Make sure to create the Dockefile in the same folder as the config file.

Dockerfile.dev

FROM nginx
COPY ./nginx.conf /etc/nginx/conf.d/default.conf

Docker Compose

We won't be going too deeply into how Compose works in this article, but suffice it to say that it ties together the individual containers we defined above.

docker-compose.yml

version: "3"
services:
  client:
    build:
      context: "./client"
      dockerfile: "Dockerfile.dev"
    stdin_open: true # fixes the auto exit issue: https://github.com/facebook/create-react-app/issues/8688
    volumes:
      - ./client/src:/app/src
  api:
    build:
      context: "./api"
      dockerfile: "Dockerfile.dev"
    volumes:
      - ./api/src:/app/src
    environment:
      - MONGO_URI="mongodb://mongo:27017"
  nginx:
    restart: always
    depends_on:
      - api
      - client
    build:
      context: ./nginx
      dockerfile: Dockerfile
    ports:
      - "3050:80"
  mongo:
    image: "mongo:latest"
    ports:
      - "27017:27017"
  dbseed:
    build:
      context: ./mongo
      dockerfile: Dockerfile.dev
    links:
      - mongo

Towards the bottom of the docker-compose.yml file you'll see the services for the MongoDB Database and the container that will seed the aforementioned database.

Now that we've finished defining the foundations of the application, we will move on to creating the mongo directory where will we define the Dockerfile for the dbseed service and the scripts for generating data.


Data Generation

Before defining the database seeding container, first we'll focus on the actual data generation for development data.

The folder structure of the data generation script and DB seeding container will match the following:

mongo/
 node_modules/
 scripts/
  index.js
  employees.js
 Dockerfile.dev
 init.sh
 package.json
 package-lock.json

The scripts will output a array of JSONs which will be imported into the database. Then the bash file, init.sh, will handle import the generated data into the running database.

Data Gen Dependencies

As part of the data generation scripts, we only utilize two NPM libraries, these are yargs and faker respectively. Which can be downloaded by running the following command

npm i faker yargs --save

These libraries make it incredibly simple to generate fake data and to handle CLI in JS files respectively.

Our Data Generation Scripts

We will have two main files we'll need for the data generation, these are a index.js file which will serve as our point of contact for the data generation and the employee.js which will hold all of the data generation functions needed for employees

index.js

const yargs = require("yargs");
const fs = require("fs");
const { generateEmployees } = require("./employees");
const argv = yargs
  .command("amount", "Decides the number of claims to generate", {
    amount: {
      description: "The amount to generate",
      alias: "a",
      type: "number",
    },
  })
  .help()
  .alias("help", "h").argv;

if (argv.hasOwnProperty("amount")) {
  const amount = argv.amount;
  const employees = generateEmployees(amount);

  const jsonObj = JSON.stringify(employees);
  fs.writeFileSync("employeedata.json", jsonObj);
}

employees.js

const faker = require("faker");
const localuser = require("../localuser.json");
const generateEmployees = (amount) => {
  let employees = [];
  for (x = 0; x < amount; x++) {
    employees.push(createEmployee());
  }
  employees.push(createEmployee(localuser));
  return employees;
};

const createEmployee = (user) => {
  const companyDepartments = [
    "Marketing",
    "Finance",
    "Operations Management",
    "Human Resources",
    "IT",
  ];
  const employeeDepartment =
    companyDepartments[Math.floor(Math.random() * companyDepartments.length)];
  const employee = {
    name: faker.name.findName(),
    title: faker.name.jobTitle(),
    department: employeeDepartment,
    joined: faker.date.past(),
    ...user,
  };
  console.log(user);
  return employee;
};

module.exports = {
  generateEmployees,
};

This script is then called by the aforementioned inith.sh, which is a simple bash file that posts the mongoimport CLI command.

init.sh

#!/bin/sh
mongoimport --collection employees --file employeedata.json --jsonArray --uri "mongodb://mongo:27017"

Database Seeding Container

Now that we've defined the scripts to generate and import the data, we can define the Dockerfile that will be utilized by Docker Compose.

Specifically, we will utilize a multi-stage build to first generate the data, and then move the data from the generator container and then utilizes it in a mongo container which then executes the init.sh bash script.

Dockerfile.dev

FROM node:10-alpine as generator
WORKDIR /data
COPY . .
RUN npm install
RUN node ./scripts/index.js --amount 10

FROM mongo:latest

COPY . .
COPY --from=generator ./data/ .
RUN ["chmod", "+x", "init.sh"]
CMD ./init.sh

Things To Keep In Mind When Generating Data

When generating development data for a MongoDB database there are three primary concerns that must be considered:

  1. DB Import Method
    1. For our case mongoimport vs mongorestore
  2. Predefined Data Vs Randomly Generated
  3. Inter-collection relationships
    1. Like the uniqueIds of one collection been utilized in another collection

Within this article we will only have to consider the first one but the other two we will cover and discuss

Importing Data Into MongoDB

There are two major methods via CLI to import data into a running MongoDB database. These are the mongoimport and mongorestore. The primary difference between these two types of import methods is the data types they work with and the metadata they preserve.

Specifically, mongorestore only works with BSON data, this allows it to run faster and preserve the metadata BSON provides.

This is possible because unlike mongoimport, mongorestore doesn't have to convert the data from JSON into BSON.

This conversion process doesn't guarantee that the rich data types provided by BSON are maintained in the import process. Hence, why mongoimport isn't recommended for usage in production systems

Why not go with mongorestore

Mongorestore is:

  1. Faster than mongoimport
  2. Preserves all of the metadata

But the reason why it'd advise to instead utilize mongoimport for development data is the simplicity it provides.

Due to the flexibility of data it can receive, mongoimport is significantly easier to use compared to mongorestore. Unlikes its faster alternative, mongoimport can directly import both JSON and CSV.

This allows us to write a simple script to generate an array of JSONs which can be easily imported as so

mongoimport --collection employees --file employeedata.json --jsonArray --uri "mongodb://mongo:27017"

Predefined Data Alongside Faked Data

There may be times where the generated data used for developmend should be related to developer-dependent information. For example, the developer has a specific logon (username and userId) and the generated data is user specific.

Hence, in order for the developer to have the data generated for their specific account, there should be an optional json that is only locally defined.

We can achieve this by creating a JSON file in the same folder as the data generation scripts. For example:

localuser.json

{
_id: "<Unique Identifier>",
name: "<User Name>"
}

Which can then be imported and used by the general data generation script as such:

const faker = require("faker");
const localuser = require("../localuser.json");
const generateEmployees = (amount) => {
  let employees = [];
  for (x = 0; x < amount; x++) {
    employees.push(createEmployee());
  }
  employees.push(createEmployee(localuser.name));
  return employees;
};

const createEmployee = (name) => {
  const companyDepartments = [
    "Marketing",
    "Finance",
    "Operations Management",
    "Human Resources",
    "IT",
  ];
  const employeeDepartment =
    companyDepartments[Math.floor(Math.random() * companyDepartments.length)];
  const employee = {
    name: name ? name : faker.name.findName(),
    title: faker.name.jobTitle(),
    department: employeeDepartment,
    joined: faker.date.past(),
  };

  return employee;
};

module.exports = {
  generateEmployees,
};

Here you can see how we can import the localuser and then create an employee based off the provided data. In this situation we could also use destructuring to provide an easier way to override the generated data with an arbitrary number of properies. Like this:

const createEmployee = (user) => {
  const companyDepartments = [
    "Marketing",
    "Finance",
    "Operations Management",
    "Human Resources",
    "IT",
  ];
  const employeeDepartment =
    companyDepartments[Math.floor(Math.random() * companyDepartments.length)];
  const employee = {
    name: faker.name.findName(),
    title: faker.name.jobTitle(),
    department: employeeDepartment,
    joined: faker.date.past(),
		...user
  };

  return employee;
};

But do note that the JSON key must match the properties defined in the 'employee' object. So to override the title and name property the localuser.json must look like this:

{
name: "Jane Doe",
title: "Senior Software Engineer"
}

Inter-collection relationships

Let's say that the company that all of our employees are a part of gives each employee a computer. In such a case we would want to keep track of each computer the company owns and the employee who currently has it.

Its schema would look a bit like this(Ignored the overly simplistic examle):

{
computerName: String,
employeeName: String
}

Hence, if we wanted to generate data for the computers the company owns we would have to utilize the names of the employees we generated.

This inter-collection example uses a computer schema that isn't how it would actually be done in real life. This would probably make more sense as an embedded document within an employee document. This example is just used for simplicity's sake.

We can do this by simply passing down the array of employees generated to the function that generates the computers.

This would look roughly like this:

const yargs = require("yargs");
const fs = require("fs");
const { generateEmployees } = require("./employees");
const {generateComputers} = require("./computers");

const argv = yargs
  .command("amount", "Decides the number of claims to generate", {
    amount: {
      description: "The amount to generate",
      alias: "a",
      type: "number",
    },
  })
  .help()
  .alias("help", "h").argv;

if (argv.hasOwnProperty("amount")) {
  const amount = argv.amount;
  const employees = generateEmployees(amount);
	const computers = generateComputers(amount, employees);

  const jsonObj = JSON.stringify(employees);
  fs.writeFileSync("employeedata.json", jsonObj);
	const computerObj = JSON.stringify(computers);
	fs.writeFileSync("computerdata.json", computerObj);
}

Where generateComputers is a function similar to generateEmployees but takes an extra parameter that holds the data that belongs to a separate collection.


Conclusion

Congrats!! Now everything you need has been hooked together and the data you need should be in the database.

You can go to localhost:3050 and should see something like this:

With all of the names, titles, departments, etc (except for the one specified in localuser.json) being randomly generated.

The final big-picture folder structure of the application should look kinda like this:

api/
client/
mongo/
nginx/
docker-compose.yml

You can checkout the Github repository to double check against your version if you're having any issues

Future Steps

  • Integrate the scripts with Typescript types used by the Mongoose Schema
  • Instead of exporting JSON by the script, have the script export in BSON instead of JSON

Empowering clients to build strong and modern software in a ecosystem-centric world. We focus on new and revolutionary technology so you can focus on your vision.

Copyright © 2021. All Rights Reserved.