CI/CD for decision science: What is it, how does it work, and why does it matter?

CI/CD is how you ship higher quality code to production faster. But with decision science models, there’s a twist. We’ll look at the role of decision quality, specific testing tooling, and problem drift on CI/CD for decision science models.

Once upon a time, twenty years ago, shipping software to production took months for most, weeks if you were lucky, and years if you were unlucky. It involved individual developers and a series of change sets delivered to release managers working for weeks to coordinate and compile it all before CDs could get printed and sent through the mail. And if you had a bug fix? Well, that’d take some time. 

Software development has come a long way in terms of frequently releasing higher quality software into production environments. Many have written about the rise of CI/CD (here’s one brief history), its impact on software development, and its place within the larger DevOps practice. But few have written about CI/CD within decision algorithm development and its place within the DecisionOps practice. 

While the common denominator across decision science algorithms and broader software is code, there are specific decision science considerations to account for in terms of how CI/CD practices get applied.  We’ve written about and shown what CI/CD looks like in this space, but haven’t yet taken a step back to simply look at what it is, how it works, and why it matters. Let’s dive in.

What is CI/CD for decision science?

CI/CD is the same for software development as it is for decision science, to a point. To understand CI/CD for decision science, let’s first start with what it is for general software development. There are loads of definitions out there, but I like this summary

“CI is a software development practice in which developers merge their changes to the main branch many times per day. Each merge triggers an automated code build and test sequence, which ideally runs in less than 10 minutes. A successful CI build may lead to further stages of continuous delivery. If a build fails, the CI system blocks it from progressing to further stages. The team receives a report and repairs the build quickly, typically within minutes.” 

CI/CD enables development teams to ship high quality software to production environments faster and with fewer errors (and outages). It’s a positive feedback loop: the more you release, the better you get at releasing. The process utilizes integrated testing techniques to ensure new code changes don’t break any interactions up or downstream of your code and you’re off to the races. And it necessitates having the process, tooling, and infrastructure in place to make it all flow. 

But when it comes to decision science applications, there’s a catch. When developing a decision model, you have to ask more than, “Does this decision service talk to this other application just as well as it did before?” or “Does this scale?” You also have to consider the decision quality the app outputs. You have to address the question, “Is this decision good for the business today?"

For example, let’s say I work on a vehicle routing decision app for a farm share company that delivers fresh goods to consumer homes. My production app accounts for business logic such as vehicle capacity and delivery windows, but we’re launching a new line of temperature-sensitive goods and I need to code in logic for matching those goods with refrigeration-ready vehicles. As I move forward, I immediately run into a classic optimization tradeoff. 

On the one hand, adding this new constraint is important to business growth and customer happiness (gotta keep that ice cream cold). On the other hand, adding this new constraint will increase my total route duration (which is what my app is trying to minimize) because not every vehicle can be used to service any stop and I may need to use more vehicles. 

How do I proceed? I test. I test that the new model “just works” with my production systems. Check. Standard CI/CD so far. But now things get interesting. I also need to test that it results in “good” decisions that make sense for the business today. For the latter, I use an acceptance test that’s kicked off as part of my CI/CD process. If my total route duration increases by more than 2% of a defined threshold, that prompts me to inspect other KPIs to see if a result outside that 2% range is acceptable. In this case, I determine (along with my teammates) the business benefits outweigh those derived from the previously set threshold. We move forward and update the acceptance test accordingly. 

CI/CD for decision science is defined by the need to account for ensuring up and downstream cohesion (just like software development), but it also must account for the decision quality it outputs to operational environments. With decision science, the quality of the output from one service or system (e.g., shift assignment model) has an impact on the next service (e.g., vehicle routing model).  

How does CI/CD for decision science work?

A few years ago, wrote about their software development team’s journey to CI/CD. They do a nice job capturing the benefits of the practice. Partway through the post, they pose the question as to why it wasn’t readily adopted within their company sooner. They wrote: 

"Because it's hard. Not in the sense that CI/CD is complicated as a concept, but because the change involves a lot of work in different areas around process, tooling, and organizational alignment."

We can confidently say we're not alone in having experienced these roadblocks with decision science and operations research. There’s a whole ecosystem of amazing and incredibly varied decision optimization technology out there that is not consistently easy to deploy, test, and manage — and it’s often not super transparent.

We can also say that it’s more realistic than ever before to overcome all of the challenges to stand up CI/CD workflows. Here are the pieces you need to consider: 

Git-based code management and collaboration. Coding is a collaborative endeavor. Source control platforms such as GitHub, GitLab, and Bitbucket and CI/CD solutions like CircleCI, Travis CI, and Semaphore are the current standard for how modern code gets developed. These solutions offer greater visibility, automation, tracking, and collaboration features compared to circulating zip files among teammates. While the days of floppy discs are behind us (we hope), we’ve come across a number of teams who develop code using dated, inefficient approaches. 

Infrastructure to support decisions as microservices. A place your model can call home — maybe multiple homes across development, staging, and production. But you need a way to get a set of API endpoints up and running quickly. This may be for operational workloads making real-world decisions or test workloads that require production-ready infra, but are providing insight about future next steps. This may be in the cloud or self-managed. 

Testing tools.  We think about decision model testing in two buckets: historical testing and online testing. Historical tests show what decisions would have been made and include batch tests, acceptance tests, scenario tests, and benchmarks. They use known, representative input sets to test across a set of KPIs. Online tests test what will happen now and include shadow tests and switchback tests. They use production data (representing current conditions) and out-of-sample inputs to derisk operational changes that could happen between development and launch. 

Model management. In a similar way that GitHub provides concepts for branches, repos, and Git tags, for users to flexibility define their workflows, we think about concepts for managing decision models through apps, versions, and instances. Switching a new version to production. Reviewing model run history. Checking on run status. Rolling back a version, etc. 

Collaboration features. Decision modeling is a team sport. There are many stakeholders involved: algorithm developers, product managers, operators, engineering leads, etc. Buy-in is key and questions are inevitable. “How are these changes to the model better?” “Why did the model make that decision?”  “How did our solution values improve over time?” “Which version is running in prod and what’s the run status from yesterday?” A shared UI/console accelerates building common understanding through transparency, sharing, and access. The ability to share test results, review routes and schedules, reference model instances and run history, and so on helps shortcut confusion.  

What does this look like in practice? Here are a few visuals to give insight into how we’ve gone about it. 

Pull request for decision model change and GitHub action that kicks off an acceptance test.
Testing workflow in the DecisionOps context.
Slack notification reporting benchmark results to the team.

Every time a developer makes a change to our codebase, they open up a PR that automatically kicks off an experiment for testing the impact of that change on our solutions. The results get returned and reported so that we're all aligned on what's merging into our stable branch. The on-call dev has to investigate what happens if the results are outside of what's expected or accepted. 

Why does CI/CD for decision science matter?

At this point, I hope it’s clear that if you’re actively and frequently developing new or existing decision models, CI/CD is a natural catalyst for shipping higher quality changes faster. You can introduce new product offerings, adapt to new SLAs, scale to new regions, and so on. While these are great benefits, they’re not the only ones. 

Let’s say you’re part of a decision science team that does not make frequent decision algorithm changes, say on the order of every six months to a year. Does CI/CD have a role to play? Yes, we think so. 

Code changes may happen at different rates. But while your code isn’t changing, your business and your data are, because the world has a tendency to change. What worked a year ago probably won’t work today. There could be a new worker payment model or buyer behavior changes. Many know this by the term problem drift. 

The danger here is that decision services start making poor decisions or no decisions at all — and possibly fail silently. Your website is showing 0-minute ETAs. Customer experience could be declining rapidly and you might not know it in time. And when a code change is needed, it’s hard to nimbly and confidently adapt when the development cadence is oriented around a months-long development cycle. 

Another “why” worth discussing is visibility and transparency across stakeholders. CI/CD for decision science models can improve the interaction and engagement with software teams deploying other backend services. It provides a standard mechanism for communicating changes and their impact to other departments in the organization. 

Lastly, and probably well-established at this point, CI/CD for decision models delivers higher quality code faster. This is better for the teams developing it and the customers impacted by it. When decision services work well, you don’t notice. But when they fail (and you accidentally take down a food delivery service for an entire country during lunchtime 😬), people notice (...because they’re hungry). CI/CD helps you ship with more confidence the first time and adapt quickly if an error still manages to slip by. 

Where to go from here

We’re big proponents of CI/CD practices for decision model development. While operations research and decision science have been around for a long time, its applications are only becoming more visible and talked about in the last 10 years within the broader industry. As more businesses look to benefit from decision models, we think it’s important to work with decisions as code — decision models can and should be developed like any other modern software.  

We’re interested in hearing from others about their experiences in this space — successes, flops, observations, and protips — as we delve deeper into this space in our own work. 

Video by:
No items found.