Continuous Integration is a best practice of modern software development. All professional software developers worth their salt practice continuous integration. Enterprise data management system implementations rarely practice continuous integration, however, because data management systems lag in adoption of most modern software development practices.
Poor quality, slow development cycles, and missed requirements are results of standard practices being a decade behind modern software engineering practices. An approach that includes continuous integration reduces all sorts of risk and speeds up development.
We have heard all the excuses. “Data management projects are not like regular software development,” is a common one. So is, “There is too much complexity to automate.” Whatever the excuse, it is just an excuse—it has been used (and overcome) before.
This is not an article about theory—this is a how-to. If you need more convincing that your team ought to be practicing CI, see some of the references at the end.
You have many options of how to implement principles of CI. This article will provide you with an inside look of processes and tools we use with HotQuant clients. Your final process could look quite different, but I hope this gives you an accelerated start.
The first section, Components, covers the parts you need in place to implement CI in data management systems. The second section, CI Process Steps, is about the process.
Source Code Repository
This one should be a no-brainer. You need a source control system to manage source code, checkins, branches, and centralization. Our favorites are GitHub and TFS. TFS is a more mature in the enterprise, but GitHub is gaining fast. Some data management platforms have their own source control built in. We regularly use RedGate SQL Source Control to bridge database development into the enterprise source control system.
A word on some not-so-obvious practices around source control.
. If your tool supports branching, clearly define a branching strategy
—but don’t get crazy. Each piece of a branching strategy should address a specific need.
Daily checkins. Culture or process should drive checkins daily at minimum—never once per week, per sprint, or per feature. Anything less than daily negates many of the benefits of CI.
Required comments. Allowing commits without comments seems so insignificant that it appears harmless. You will be grateful for the comments when you have to pore over hundreds of commits to trace an issue.
Database Virtualization (or alternative)
The new kid on the block is database virtualization. The old method required backups and restores, or scripting DDL and DML—manually or from source control. These are still valid, but time consuming (see above note on excuses). Database virtualization allows you to take bookmarks and restore database instances to a point in time, and fairly quickly.
Currently we are smitten with Delphix. This is not a trivial process, and Delphix has nailed it. What they do not provide is guidance on a database virtualization strategy. Things get out of control quickly when the team realizes how easy it it so make database copies. Enforce simplicity and consistency of process.
Alternatives are out there, and the old method is still valid. It just takes longer, reducing the number of builds you can fit into a time period. It is not the end of the world.
None of this will work if the only thing IT provides is a single QA/Test environment. You will need several environments, more than you might think. Sometimes approaches a ratio of 2:1 environments per developer.
Automated Test Scripts
Testing is at the core of CI. Without an automated test suite, CI loses value precipitously. Who cares if you can automate the build if you cannot verify it?
A note on testability. Some architectures are more “testable" than others. Mainstream software development architecture has gone through a revolution, driven by testing. Imagine that, the architecture of a system being changed to support testability! But such is the value of automated testing. As I stated earlier, data management implementation practices are a decade behind modern software development; many designs are less testable than others.
- Unit testing. We use tSQLt for database unit testing. RedGate has a nice plugin for it to SQL Server Management Studio. tSQLt is limited to testing database objects. To test data pipelines or processes, the heart of your data management system, we use hqTest.
- Integration and regression testing. Integration tests combine code from multiple teams. Regression tests make sure new code introduced in the build does not break existing code. For both, we use hqTest.
- UI-based testing. Some tests are really about UI functionality rather than data pipeline or database functionality. Selenium is a good open source framework for UI tests. Warning: limit UI testing, as it is the most complex, with the least value of automation.
This is fundamental. Good options are JetBrains TeamCity and Atlassian Bamboo. I will leave it to another day to compare the two. Microsoft TFS has CI capability as well, but we have only ever used it for .NET-based development projects.
Technically the CI server and Build server can be two different things, but are often combined into a single machine or set of machines.
The actual build of the code varies widely by the data management platform in play. The traditional approach is to script the build, but some platforms provide capability to export packages. That removes some of the scripting need, but you will still have to script the package export.
CI Process Steps
The CI server exercises the steps of the process linearly, either on a schedule (e.g. nightly) or on a triggered basis (e.g. on every checkin).
Often teams use a step-wise approach to implementing CI. First, define clear, specific steps in your build and CI processes. Then script each step. The scripts are what you plug in to the CI server for execution. You can implement the first scripted step while you are scripting (or defining) the remaining steps.
Create new build or packages
If the code fails to build or package properly, no need to continue the remaining steps. This is where your source control system and data management system’s packaging capabilities come into play.
Prepare an environment consistent with the test baseline
Your tests require a consistent starting point to ensure consistent testing results. Database virtualization is a godsend here if you have it available, a close second being tools with strong database source control. Alternatively, you will clear out database tables and other items containing state, and populate them with whatever contents are necessary to achieve the consistent starting point.
You must have an available environment to pull this off. Overwriting an active QA environment, accessible by testing individuals, is not an option.
Apply updated build or packages
This step applies the new build or packages to the environment prepped in the previous step. Note that the build or packages will be the same across all downstream environments. For example, you may have a process that performs CI on one environment, then manual testing in a QA environment, then deployment to production. Each of the environments will receive the same exact build.
Apply environment configuration parameters
This is a simple, but separate and important step. You must apply environment parameters such as connection strings, passwords, logging levels, etc. on a per-environment basis. These will be different across each downstream environment.
Run automated tests
Some tests take longer than others, and some tests are more valuable than others at different stages. More complex CI setups may have various stages, with earlier stages built to run faster and provide lower-level feedback; later stages running larger tests but running less frequently.
But for our purposes of this how-to, assume a single test suite that includes unit, integration, and regression testing. Run the test suite via plugins to the CI server or via command line, capture the test results, and fail the build if any tests fail.
Notify of failed builds and fix them ASAP
A core principle of CI is the mainline must be deployable at all times. When a build fails, notify someone using the built-in notification tools available in most CI servers. Have a development process that includes slack time to allow investigation and resolution of failed builds. Some CI servers can plug in to your ticketing or project management system to automatically create a task related to fixing the broken build.
The more frequent the builds, the more real-time the feedback will be—and the more likely a failed build will be due to something the developer is working on at the moment. The result is faster resolution.
Deploy successful builds to downstream environments
Very few mainstream software development teams employ fully automated, continuous delivery; likely no data management project team does so. The last step of the CI process is to push successful builds downstream for further verification by manual testing. The amount of manual testing remaining is a function of how much QA risk the CI process is able to cover through automated testing.