What’s a Magical System?
A magical system is any black box system that gives indeterministic results. These systems’ performance is difficult to predict outside of the production environment. Because of the training or testing data limitation or because of drift in input source (relevant input data slowly changing over time). A magic system can convert fuzzy input noise into discrete signal output for the end users.
The problem is we can only guarantee the performance of the magic with hot real-life input.
Examples of magic systems
- Neural networks
- Machine learning models
- Fuzzy matching algorithms
- computer vision
- Bayesian inference
- Systems with volatile inputs (raw emails, pdfs)
Properties of the magic system
- The system has some input and some output
- We can’t validate the quality of the input
- The system cannot measure the correctness of the output
- We can only measure the correctness of the system after the fact
- So we are not talking about “best effort” services.
The Challenge
We can only know how the magic works once we use it. But often, when using magic professionally, we can only get away with using it with guarantees. It’s scary enough to deploy deterministic code to production, even after all the testing and acceptance. We need to deploy and monitor these systems over time. These are often business-critical systems that can’t afford silent defects to end users.
This blog is about developing architecture and strategies to construct these systems. Hopefully, it will inspire and inform you to think more thoroughly about building, testing, and deploying magic to production.
Input Drift
Do you know how AI models work by analyzing data to make predictions? If you don’t update that data regularly, the models become outdated and less accurate. It’s like trying to predict the weather using last year’s forecast - it might be accurate, but it won’t be precise. That’s why it’s vital to keep retraining your AI models with new data - it ensures that your predictions stay relevant and reliable. The same is true for any predictive system. Regular expressions, stochastic systems, and rule-based matching built for today’s rules will become stale and need to be constantly updated.
End-User trust
Due to their complex nature and lack of transparency, explaining how magical systems work to end-users can be challenging. This challenge can lead to skepticism and distrust of the system, making gaining the user’s confidence tough.
Trust is not about accuracy. It’s about transparency. The most effective way to build trust is by providing hard data that the user can use to verify the system themselves. A user will trust a system of 90% accuracy that can explain itself comprehensively than a system of 99% accuracy that is completely opaque.
Circle 0: Blind Magic
You might have something like this when setting up a POC for a magical system.
The Magic service is a black box that can predict a fuzzy piece of Noise data. We need to put a client between the user and the system to pre-process and interpret the prediction.
Circle 0.1: Async Magic
Predictions will take a while. They will also fail often. We need to synchronize this process. This way, we can get predictions at our own pace and measure when predictions fail or time out without holding back the process.
0.2: CallBack Magic
To avoid building inefficient polling systems, we can use a callback URL that the system will call when a prediction finishes. When the client responds to the callback and retrieves the prediction result, you can remove records for the completed prediction.
0.3: Event Magic
You will come to a point where the client requesting the prediction differs from the client that consumes it. Alternatively, the prediction could be consumed by multiple clients. In this case, we should use an event system to broadcast when a prediction finishing.
Now we have the seeds of a framework that can scale with the needs of a growing magical system. Deploying, testing, and growing a magical system for multiple clients can have unpredictable results. You want to do something other than bottleneck your throughput with a naive synchronous call. Furthermore, an eventing system allows us to measure the system’s overall performance.
Circle 1: Feedback
To manage a magic system, you need feedback. The nature of the magic system is indeterministic; you can only prove its correctness after deploying it. Measuring and managing performance can only truly happen in production with real data. The core principle in running these magic systems in production has consistent real-time feedback. With it, you can improve your system and respond to defects when it’s too late.
Feedback for a magic system comes in the form of prediction evaluations. That is, a human user, with eyes and fingers, evaluates your system’s predictions.
Discrete predictions
Discrete predictions are values we can predict correctly or incorrectly without ambiguity. These are specific codes, symbols, strings, or enumerations, any classified values. These predictions can either be correct or incorrect.
There are four types of outcomes in an evaluation.
- Correct
- Incorrect
- Abstained
- Unevaluated
Evaluation➡ Prediction ⬇ |
“Foo” | “Bar” | "" | null | |
---|---|---|---|---|---|
“Foo” | Correct | Incorrect | Incorrect | Unevaluated | |
"" | Incorrect | Incorrect | Correct | Unevaluated | |
null | Abstained | Abstained | Abstained | Abstained |
Empty values mean this feature does not apply to the input. The predictor is certain the value for this feature is not available from the given input. Abstained values mean that the predictor cannot confidently say anything about the feature. The predictor is not certain and therefore abstains from giving an incorrect answer. In magic systems, incorrect answers are worse because they undermine human confidence. The abstained evaluation is important. Sometimes you can have a recognizer that is excellent at predicting specific cases. This predictor only makes predictions when they are relatively confident. You want to give preference to predictors that may abstain often, but when they do make a prediction, that prediction is spot on.
To do this, you must distinguish between predictors predicting nothing from those who abstain. This way, you not only measure the accuracy of the different predictions, but also their confidence.
Continuous predictions
These are predictions of measurements like weight, positions, speed, color grading, pressure, temperature, mass, volume, etc.
Prediction of continuous values is more difficult to quantify. But you have two options:
You can record the accuracy of a predictor by adding up the differences between its predictions and their evaluations.
Or
You can give the predictor a tolerance. Treat results within tolerance as correct and anything else as incorrect.
Avoid probabilistic predictions
These predictions are never probabilities! They can only be evaluated continuously. If you make a probabilistic production and evaluate it with a discrete value, there’s no way for the system to improve.
Say your input is an insurance claim, and your magic system can provide the probability the claim will be covered. You can’t evaluate this signal with a discrete ‘It has been covered or not.’ The discrete feedback can only control the system’s sensitivity.
If your system, for some reason, cannot avoid probability results, they can only be evaluated by other probabilities. In the previous example, the way to evaluate the system’s prediction is for a professional risk assessor to provide their independent probability and feed it back to the system. This way, the system can build fitness.
1.1: Event-Driven Feedback
Build your system to gather its feedback instead of waiting for input from the outside. This way, you can gather feedback from multiple sources instead of having each evaluator send its results in a format you like. By using events, the magic system can dictate the terms of evaluations.
1.2: Multiple Feedback
A prediction may be evaluated multiple times. Multiple humans could evaluate a prediction, or the evaluation could change over time.
The system could adhere to the first, last, mean, median, or mode(bucket) evaluations depending on the case.
1.3 Feedback processing
Evaluation of discrete features like strings and enumerations can be tricky. Localization and different evaluators can pollute your dataset with false failures. When your magic system predicted the string “Foo” and the user evaluated it with “foo,” should that count as correct or incorrect?
Some tricky situations can arise with the localization of monetary amounts:
- “$30 000”
- “$30,000”
- “$30000-”
- “30000.00USD”
- “30000,00000”
Or dates:
- “3-4-2022”
- “3-4-2022”
- “4/3/2022”
- “3.4.2022”
- “03.4.2022”
- “03.04.2022”
Your system must have some post-processing of incoming evaluation to conform to a certain culture and format to eliminate endless false failures in your metrics.
Discrete evaluations
When dealing with discrete evaluations, you want some way to find out what authoritative evaluation is. Here are some strategies for picking an authoritative evaluation:
- the modal value
- reliability based on source
- initial evaluation
- terminal evaluation
You should pick a strategy that makes the most sense for your selected feature.
Continuous evaluations.
When dealing with continuous evaluations, you can still apply the strategies above. In addition, you could take either the mean of all evaluations or even use a statistical clustering model. But avoid using these techniques on discrete evaluations.
Circle 2: Sharding Predictions
When creating a signal from the noise, it is important to split it into several features. If we don’t do that, we will provide an all-or-nothing prediction that is too brittle to be useful. Splitting the signal into several features allows us to measure the accuracy of the predictions of each feature.
Slicing the magic
An all-or-nothing prediction is unacceptable in production. Any amount of drift or input data divergence could invalidate an entire signal. For this reason, we must split our magical system into several predictors. Each predictor is a separate module responsible for creating predictions for a feature. This way, if one predictor is defective or suffering from drift, it will only affect the quality of one of the features and not the entire signal.
Even though one of our predictors is ineffective, I can guarantee that most of our signal is still useful for the end user. There needs to be more to guarantee continuous quality.
Circle 3: Competition
To ensure performance even on an individual feature level, we need to diversify our holdings. The system must be able to let multiple predictors create a prediction for the same feature.
Here we see that PredictorC and PredictorD both predict Feature 3. A subsystem called a Judge exists to decide which prediction it should choose to be passed to the user.
The Judge
The Judge is a subsystem that can use any logic to pick one result as the one true result. The Judge doesn’t need to pick between Predictors but rather pick between the predicted values.
Here are some strategies a prediction judge could use:
- Pick the result value of the historically best-performing predictor
- Use metadata to pick the best performer in the context
- Use pick the modal value of all predictions for this run
- Rotate results from different predictors equally
- Use any stochastic system
Weighted Predictors
If one of your predictors is defective and fails, the system should
Dry-Running Predictors
You can tell the system to ignore results from certain predictors. This crucial mechanic lets you measure performance before exposing potentially false results to the end user.
Circle 4: Multi-Predictors
The next step is to allow for predictors to predict multiple features. You could set up separate predictors for each possible feature, which could be more efficient. When we make different predictors compete, we are actually measuring the performance of different prediction methods. So ideally, you would have a predictor equipped with a certain method(e.g., statistics, ANNs, metadata rules) to produce as many features as possible.
Here we see that Predictor C and D both predict Feature 3, and the Judge can decide which method is a better fit to make the prediction.
Circle 5: Dependencies
Very quickly, you will find that to predict a feature. We depend on existing information. If we want to predict the topic of a text, a system would need to predict the language the text was written in. There might be better methods to predict the language than the method to predict the topic. In this case, predicting the topic is dependent on the language; language prediction is a dependency on the topic predictor.
We achieve this by splitting the entire prediction pipeline into several layers. Each layer is dependent on the output of the layer before. If each predictor knows its dependency, the system could use DFS , a dependency tree, and automatically generate layers.
Each layer should have a judge because you will come to a situation where there will be multiple predictor sources for a dependency. Each layer must output a single prediction per feature.
Circle 6: Given-Features
We apply the same logic of dependencies to the input layer. Sometimes we have given features: data provided to us before prediction begins. We should handle it as any other dependency. Sometimes this is meta-data; other times is a best guess from a different system.
Once you deal with a sophisticated production system like this, you will find that your provided data is only sometimes reliable. This is why it’s important also to measure the performance of the sources of your given features.
Remember that everything in this business is a migration. You will likely be implementing a prediction system alongside an existing service. To ensure safety and measure performance, you should pass the results of the legacy system as Given-Features to your new Magic System. This way, you can measure performance and safely migrate load to your new (way better) predictors.
Circle 7: Manual bias
The system prescribed so far is self-correcting. However, we have designed it to allow for control in the face of fuzziness. So being able to influence and override the outcomes manually is key. We can’t control the output of the predictors directly, but we don’t need to if we can control the outcomes of the judging phase.
Laconic predictors
Judges by themselves should use statistical analysis of the origin predictor’s performance to pick a winning prediction, but more is needed to cover edge cases. For example, you could have a ’laconic predictor: a predictor with a high level of accuracy but a low response rate. This predictor usually abstains, but when it does provide a prediction, it is highly accurate. This kind of predictor can be very valuable, but more eager and inaccurate predictors might overrule it. To solve this, you should be able to give this laconic predictor preferential treatment by the judges manually. The Judge first considers the laconic predictor’s results. If it abstains, the Judge considers the others equally.
Weight Ranking
Following this, you should be able to manually rank each of your predictors by confidence. Judges will first consider the ranking and apply their statistical analysis to tie-breakers. This sequence allows for the greatest manual control while allowing for drift and fitness.
Use the manual ranking system sparingly! The Judge’s tie-breaker lets the system self-heal in the case of drift. If you use it too rigidly, you’ve created a fragile rule-based system.
Multiple judges
If you have multiple tie-breaking systems in mind and don’t know which is best, you could run them both in parallel and measure the results separately. This method effectively creates multiple judges that a meta-judge can evaluate. It will explode the variance in your system and bury it in overhead. I’ve run multiple judges in the past, and the juice has never been worth the squeeze. Do not recommend it!
Circle 8: Conditional Ranking
Note that the best historic performer will give you overall good results. However, the best historic performer could cause continuous problems for edge cases. This problem is similar to the problem of the laconic predictor but can be determined beforehand. Sometimes you know a certain predictor is very good at predicting under certain conditions.
Example
In a system predicting the number of birds on a picture, one predictor called ’night vision’ is very good at counting the birds, but only for pictures at night. We know when the pictures are at night or can reliably predict by another predictor. We can’t set the ranking to prefer night vision because most of our incoming pictures are day shots, and night vision is still very eager. We need to have multiple rankings.
Multiple rankings
These are similar to the multiple Judge problem, but we can determine which one to use beforehand. Each ranking of predictors comes with a condition. In the above example, we will have one default ranking and one ’night vision’ ranking that is only applied when we know that the picture we are predicting is a night picture. This sequence automatically applies the best predictors to the most applicable cases.
Circle 9: Prediction Arrays
When making a prediction, sometimes your system will give a collection of results. When dealing with a collection of signals, you need to keep in mind which of two kinds of signal arrays you are talking about:
Homogenerative arrays (Duplicate entities expected)
Say we have a doggie photobook, and we have to the entrants
A predictor will spit out an array of doggies, some of which will be the same doggie.
{
"results":{
"A":2,
"B":7,
"C":3,
"D":1
}
}
In this case, we can expect duplicate items of the same entity. We can use this signal to derive all kinds of things.
Heterogenerative arrays (Duplicates unexpected)
However, if we have a single picture of a doggo birthday party.
Then the system will return an array of unique doggos.
{
"result":[
"A, " "B, " "C, " "D"
]
}
Any duplicates should be either consolidated and added to the result or ignored due to low confidence.
Evaluating Arrays
When evaluating arrays, we come up against a problem. Array items are discrete predictions, but it could be more practical to evaluate them discretely. The larger the arrays, the higher the chance there will be some discrepancy that will invalidate your entire prediction.
For this reason, it is more practical to treat arrays as continuous predictions . Score the prediction based on how many items in your array were present in the evaluation and take penalties for the items present in the evaluation but not in the prediction.
# Something like this
def calculate_accuracy(prediction, evaluation):
total_items = len(evaluation)
correct_items = 0
for item in prediction:
if an item is actual:
correct_items += 1
accuracy = correct_items / total_items
return accuracy
Circle 10: Nested Predictions
As soon as we are up to arrays, the requirement to predict features of the individual array items will arise. Once we predict who came to the doggo’s birthday party, it’s only a matter of time before the question comes to figure out what they were wearing, their name and color, and what they brought as a gift.
This problem is the final level of magic in production. If you have set up your system correctly, an individual predictor can start a new sub-operation targeting a different set of features as part of an existing operation. The predictor provides this entire context as provided fields for the new operation. The hosting operation continues when this new sub-operation resolves and the predictor uses the results as part of its prediction.
Recursive Evaluation.
Each operation must carry its child sub-operations created recursively to the evaluator. The evaluation will happen simultaneously for the consumer, but the system must know which part of the evaluation must be attached to which sub-operation.
If we account for each feature and subfeature in reverse, the system can fully reference itself recursively.
All you need to predict sub-features and limitlessly predict and measure ever more granular levels is more compute and patience.
Conclusion
Wrangling indeterministic systems in production should be considered. The demands of these magical systems grow exponentially with their effectiveness. In addition, the drift of noisy inputs makes it inevitable to lose control. With continuous and safe experimentation and ample failover, your end-users will gain trust in your magical system.
If you’ve made it this far, please shoot me a message. What do you think of all this? This piece is a culmination of three different real-world applications. Each tackles several problems, but only some could span all 10. I’m curious about other real-world applications that try to safely deliver non-deterministic results to end users and tackle the problems of drift and granularity.
At the end of all of this, we see why such things are reserved for the largest, most dedicated, ambitious, and naive teams. If you have such a team and the juice is worth the squeeze, learn from the lessons of those of us who have built and migrated these systems and used them not as research projects but in the domain of tangible consequences. For everybody else, if you can at all avoid it, never build something like this yourself.