The road to production is paved with testing and experimentation

Do you launch decision model A or B? What happens if an operator makes a manual override? How does a new optimization model perform against real-world data? Testing and experimentation has the answers, but getting them has traditionally been a challenge.

Business decisions (and the decision support systems behind them) don't exist in a vacuum. For example, where to place an item on a shelf or how long it is available can affect its price and other products around it. Even decisions about which warehouse to store inventory in can affect shopper satisfaction with delivery times or the price of the item. We all know that the real world is complex, and implementing systems that rise to its challenges requires lots of iteration and testing.

I've spent a lot of time talking with teams about testing. (I've written about it too.) Just last week, our team was at INFORMS Analytics 2023 and optimization model testing as a topic resonated for many operations researchers, data scientists, and technology leaders trying to sustainably implement testing techniques in their organizations. In those conversations, common threads have emerged about where teams get stuck with testing for decision optimization.

First, there are business questions they grapple with: What KPI are we trying to impact? What are potential solutions to test? How should we test? What kind of test will give us the confidence to go live? How do you communicate test results effectively? Do they require any changes to how we do business or how we train workers?

Second, there are technical questions to tackle: How do I know if all my constraints are being met? How does my value function change over time? How does the value function change from one solution to the next?

Finally, even if these teams are lucky enough to have some or all of the answers, there's the question of how to make the process behind it repeatable, consistent, and resilient. This is what we’re looking to solve with Nextmv. 

Optimization testing, that “someday” project

Early in my career, I worked as an engineer and analyst designing and testing naval systems using intricate simulations. These missile systems were complicated to test, and we couldn't deploy them until we had the necessary confidence of engineers, operators, and other military stakeholders. To achieve that, we had countless design reviews where we had to show progress on the same KPIs and charts, using the same process each time even when initiatives or features differed. 

The key to succeeding in that process: consistency. As humans, we're pretty good pattern matchers and interpreters, but if we present stakeholders with different KPIs and different charts using different processes each time, it slows things down and introduces the possibility of every engineer's dreaded state: failure to launch.

Achieving the right level of consistency starts with codifying experiments in the same way we like to think about codifying decisions. That might be through standardizing test sets or managing a test script so that you can present results to your stakeholders using a consistent framework for them to adopt and understand. Yet, setting up, implementing, and maintaining such a framework often requires a bespoke, home-grown system that frequently gets relegated to an aspirational “someday” project or becomes hard to maintain.

At Nextmv, we’re working to bridge the gaps between product managers, data scientists, and data analysts working with decision optimization and the business operations impacted. To do so, we’re building out a cohesive decision testing and experimentation suite with collaboration in mind.

The pieces of the testing puzzle

Figuring out where to start with a testing framework is often the hardest part. We’ve spoken with a lot of operations researchers, data analysts, product managers, and decision engineers about how they think about the model testing process and paired it with our own experiences. 

We decided to start with the simple yet foundational pieces: exposing run history, managing sets of input data from prior runs, shareable apps, and run views between teams. These are the components upon which we can engineer increasingly sophisticated levels of model testing and experimentation techniques. As a result, we’ve grouped the universe of optimization model testing into three main categories: local testing, batch testing, and production testing. 

Local testing for decision models

“I want to ensure that the service time constraint I added to my delivery model works as expected”

  • Used to debug, develop, and iterate for new constraints, input/output, or value function (KPI)
  • Involves small sample files that represents a few cases you care about
  • Includes debugging, unit testing, system testing, and more

Batch testing for decision models

“I am satisfied that adding a service time constraint to my model locally works as expected, now I want to understand how it performs on a dataset with larger order volumes”

  • Used to validate that that our new model is ready for production tests and is likely to make a business impact
  • Involves collecting or generating a batch of input files (e.g., all high-volume days for the past 3 months or generated 10x volume) that represent a distribution of the scenarios you will see in real operations  
  • Includes acceptance, regression, and compliance testing, benchmarking, scenario planning, simulation, and more

Production testing for decision models

“My batch tests even with large order volumes met business requirements and got stakeholders excited. Let’s see if my model makes the business impact on driver satisfaction I expected”

  • Used as go/no go criteria for full roll out of a decision model
  • Involves using the app live in a production environment
  • Includes shadow, switchback, and A/B testing as well as Blue / Green deployments and more

Note, not every new model (or change to an existing model) requires all of these steps, but together they build confidence across the org and facilitate collaboration with very little context switching. The higher the risk of the plan, the more testing is relevant. 

With the ability to test during development (e.g., what’s the impact of this new constraint on model results?) and in production environments (e.g., how would my on-the-ground, real-time operations change with this new constraint?), you can build trust with stakeholders and have more confidence in the decisions being made. 

Collaborating builds confidence, confidence leads to launch, and launch leads to better user experiences. Get started with Nextmv's current testing capabilities for by creating a free account. Stay tuned for a series of blogs and videos that will dive deeper into testing and for a future live techtalk on the topic!

May your solutions be ever improving 🖖

Editor's note (May 16, 2023): The banner image of this post updated the term "Base experiments" to "Batch experiments."

Video by:
No items found.