What’s a Magical System?

A magical system is any black box system that gives indeterministic results. These systems’ performance is difficult to predict outside of the production environment. Because of the training or testing data limitation or because of drift in input source (relevant input data slowly changing over time). A magic system can convert fuzzy input noise into discrete signal output for the end users.

flowchart LR Noise-->m([Magic])-->Signal

The problem is we can only guarantee the performance of the magic with hot real-life input.

Examples of magic systems

Neural networks
Machine learning models
Fuzzy matching algorithms
computer vision
Bayesian inference
Systems with volatile inputs (raw emails, pdfs)

Properties of the magic system

The system has some input and some output
We can’t validate the quality of the input
The system cannot measure the correctness of the output
We can only measure the correctness of the system after the fact
So we are not talking about “best effort” services.

The Challenge

We can only know how the magic works once we use it. But often, when using magic professionally, we can only get away with using it with guarantees. It’s scary enough to deploy deterministic code to production, even after all the testing and acceptance. We need to deploy and monitor these systems over time. These are often business-critical systems that can’t afford silent defects to end users.

This blog is about developing architecture and strategies to construct these systems. Hopefully, it will inspire and inform you to think more thoroughly about building, testing, and deploying magic to production.

Input Drift

Do you know how AI models work by analyzing data to make predictions? If you don’t update that data regularly, the models become outdated and less accurate. It’s like trying to predict the weather using last year’s forecast - it might be accurate, but it won’t be precise. That’s why it’s vital to keep retraining your AI models with new data - it ensures that your predictions stay relevant and reliable. The same is true for any predictive system. Regular expressions, stochastic systems, and rule-based matching built for today’s rules will become stale and need to be constantly updated.

End-User trust

Due to their complex nature and lack of transparency, explaining how magical systems work to end-users can be challenging. This challenge can lead to skepticism and distrust of the system, making gaining the user’s confidence tough.

Trust is not about accuracy. It’s about transparency. The most effective way to build trust is by providing hard data that the user can use to verify the system themselves. A user will trust a system of 90% accuracy that can explain itself comprehensively than a system of 99% accuracy that is completely opaque.

You might have something like this when setting up a POC for a magical system.

sequenceDiagram actor Human Human ->> Client: Noise Client ->> Magic: Predict This note over Magic: Some time passes Magic -->> Client: Prediction Client -->> Human: Signal

The Magic service is a black box that can predict a fuzzy piece of Noise data. We need to put a client between the user and the system to pre-process and interpret the prediction.

Circle 0.1: Async Magic

Predictions will take a while. They will also fail often. We need to synchronize this process. This way, we can get predictions at our own pace and measure when predictions fail or time out without holding back the process.

sequenceDiagram actor Human Human ->> Client: Noise Client ->> Magic: Predict This Magic -->> Client: Prediction Id note over Magic: Some time passes loop polling Client ->> Magic: Get Prediction by Id Magic -->> Client: Prediction Results alt prediction finished Client ->> Human:Signal end end

0.2: CallBack Magic

To avoid building inefficient polling systems, we can use a callback URL that the system will call when a prediction finishes. When the client responds to the callback and retrieves the prediction result, you can remove records for the completed prediction.

sequenceDiagram actor Human Human ->> Client: Noise Client ->> Magic: Predict This Magic -->> Client: Prediction Id note over Magic: Some time passes Magic ->> Client: Prediction Completed Callback Client -->> Magic: Client ->> Magic: Get Prediction by Id Magic -->> Client: Prediction Results Client ->> Human:Signal

0.3: Event Magic

You will come to a point where the client requesting the prediction differs from the client that consumes it. Alternatively, the prediction could be consumed by multiple clients. In this case, we should use an event system to broadcast when a prediction finishing.

Now we have the seeds of a framework that can scale with the needs of a growing magical system. Deploying, testing, and growing a magical system for multiple clients can have unpredictable results. You want to do something other than bottleneck your throughput with a naive synchronous call. Furthermore, an eventing system allows us to measure the system’s overall performance.

Circle 1: Feedback

To manage a magic system, you need feedback. The nature of the magic system is indeterministic; you can only prove its correctness after deploying it. Measuring and managing performance can only truly happen in production with real data. The core principle in running these magic systems in production has consistent real-time feedback. With it, you can improve your system and respond to defects when it’s too late.

Feedback for a magic system comes in the form of prediction evaluations. That is, a human user, with eyes and fingers, evaluates your system’s predictions.

sequenceDiagram actor Human A participant Client A participant Magic participant EventBus participant Client B actor Human B Human A -->> Client A : Noise Client A->>+ Magic: Predict This Magic -->> Client A: Prediction Id note over Magic: Some time passes Magic ->>- EventBus: Prediction Completed Event EventBus -->> Magic: EventBus ->>+ Client B: Prediction Completed Client B -->> EventBus: Client B ->>- Magic: Get Prediction by Id Magic -->> Client B: Client B ->> Human B : Signal note over Human B: Season's Change and Aeons turn Human B ->> Client B: Evaluate prediction Client B ->> Magic: Evaluation

Discrete predictions

Discrete predictions are values we can predict correctly or incorrectly without ambiguity. These are specific codes, symbols, strings, or enumerations, any classified values. These predictions can either be correct or incorrect.

There are four types of outcomes in an evaluation.

Correct
Incorrect
Abstained
Unevaluated

Evaluation➡ Prediction ⬇	“Foo”	“Bar”	""	null
“Foo”	Correct	Incorrect	Incorrect	Unevaluated
""	Incorrect	Incorrect	Correct	Unevaluated
null	Abstained	Abstained	Abstained	Abstained

Empty values mean this feature does not apply to the input. The predictor is certain the value for this feature is not available from the given input. Abstained values mean that the predictor cannot confidently say anything about the feature. The predictor is not certain and therefore abstains from giving an incorrect answer. In magic systems, incorrect answers are worse because they undermine human confidence. The abstained evaluation is important. Sometimes you can have a recognizer that is excellent at predicting specific cases. This predictor only makes predictions when they are relatively confident. You want to give preference to predictors that may abstain often, but when they do make a prediction, that prediction is spot on.

To do this, you must distinguish between predictors predicting nothing from those who abstain. This way, you not only measure the accuracy of the different predictions, but also their confidence.

Continuous predictions

These are predictions of measurements like weight, positions, speed, color grading, pressure, temperature, mass, volume, etc.

Prediction of continuous values is more difficult to quantify. But you have two options:

You can record the accuracy of a predictor by adding up the differences between its predictions and their evaluations.

You can give the predictor a tolerance. Treat results within tolerance as correct and anything else as incorrect.

Avoid probabilistic predictions

These predictions are never probabilities! They can only be evaluated continuously. If you make a probabilistic production and evaluate it with a discrete value, there’s no way for the system to improve.

Say your input is an insurance claim, and your magic system can provide the probability the claim will be covered. You can’t evaluate this signal with a discrete ‘It has been covered or not.’ The discrete feedback can only control the system’s sensitivity.

If your system, for some reason, cannot avoid probability results, they can only be evaluated by other probabilities. In the previous example, the way to evaluate the system’s prediction is for a professional risk assessor to provide their independent probability and feed it back to the system. This way, the system can build fitness.

1.1: Event-Driven Feedback

Build your system to gather its feedback instead of waiting for input from the outside. This way, you can gather feedback from multiple sources instead of having each evaluator send its results in a format you like. By using events, the magic system can dictate the terms of evaluations.

1.2: Multiple Feedback

A prediction may be evaluated multiple times. Multiple humans could evaluate a prediction, or the evaluation could change over time.

The system could adhere to the first, last, mean, median, or mode(bucket) evaluations depending on the case.

1.3 Feedback processing

Evaluation of discrete features like strings and enumerations can be tricky. Localization and different evaluators can pollute your dataset with false failures. When your magic system predicted the string “Foo” and the user evaluated it with “foo,” should that count as correct or incorrect?

Some tricky situations can arise with the localization of monetary amounts:

“$30 000”
“$30,000”
“$30000-”
“30000.00USD”
“30000,00000”

Or dates:

“3-4-2022”
“3-4-2022”
“4/3/2022”
“3.4.2022”
“03.4.2022”
“03.04.2022”

Your system must have some post-processing of incoming evaluation to conform to a certain culture and format to eliminate endless false failures in your metrics.

Discrete evaluations

When dealing with discrete evaluations, you want some way to find out what authoritative evaluation is. Here are some strategies for picking an authoritative evaluation:

the modal value
reliability based on source
initial evaluation
terminal evaluation

You should pick a strategy that makes the most sense for your selected feature.

Continuous evaluations.

When dealing with continuous evaluations, you can still apply the strategies above. In addition, you could take either the mean of all evaluations or even use a statistical clustering model. But avoid using these techniques on discrete evaluations.

Circle 2: Sharding Predictions

When creating a signal from the noise, it is important to split it into several features. If we don’t do that, we will provide an all-or-nothing prediction that is too brittle to be useful. Splitting the signal into several features allows us to measure the accuracy of the predictions of each feature.

flowchart LR m((Magic)) subgraph Noise n[Noise] end Noise --> m subgraph Signal f1[Feature 1] f2[Feature 2] f3[Feature 3] end m --> f1 m --> f2 m --> f3

Slicing the magic

An all-or-nothing prediction is unacceptable in production. Any amount of drift or input data divergence could invalidate an entire signal. For this reason, we must split our magical system into several predictors. Each predictor is a separate module responsible for creating predictions for a feature. This way, if one predictor is defective or suffering from drift, it will only affect the quality of one of the features and not the entire signal.

flowchart LR subgraph Noise n[Noise] end n--> pA n--> pB n--> pC subgraph Magic pA[PredictorA] pB[PredictorB] pC[PredictorC] end pA --> f1 pB --> f2 pC --> f3 subgraph Signal f1[Feature 1] f2[Feature 2] f3[Feature 3] end

Even though one of our predictors is ineffective, I can guarantee that most of our signal is still useful for the end user. There needs to be more to guarantee continuous quality.

Circle 3: Competition

To ensure performance even on an individual feature level, we need to diversify our holdings. The system must be able to let multiple predictors create a prediction for the same feature.

flowchart LR subgraph Noise n[Noise] end n--> pA n--> pB n--> pC n--> pD subgraph Magic pA[PredictorA] pB[PredictorB] pC[PredictorC] pD[PredictorD] subgraph Judge pr1[Prediction 1] pr2[Prediction 2] pr3[Prediction 3] end end pC --> pr3 pD --> pr3 pA --> pr1 pB --> pr2 pr3 --> f3 subgraph Signal f1[Feature 1] f2[Feature 2] f3[Feature 3] end pr1 --> f1 pr2 --> f2

Here we see that PredictorC and PredictorD both predict Feature 3. A subsystem called a Judge exists to decide which prediction it should choose to be passed to the user.

The Judge

The Judge is a subsystem that can use any logic to pick one result as the one true result. The Judge doesn’t need to pick between Predictors but rather pick between the predicted values.

Here are some strategies a prediction judge could use:

Pick the result value of the historically best-performing predictor
Use metadata to pick the best performer in the context
Use pick the modal value of all predictions for this run
Rotate results from different predictors equally
Use any stochastic system

Weighted Predictors

If one of your predictors is defective and fails, the system should

Dry-Running Predictors

You can tell the system to ignore results from certain predictors. This crucial mechanic lets you measure performance before exposing potentially false results to the end user.

Circle 4: Multi-Predictors

The next step is to allow for predictors to predict multiple features. You could set up separate predictors for each possible feature, which could be more efficient. When we make different predictors compete, we are actually measuring the performance of different prediction methods. So ideally, you would have a predictor equipped with a certain method(e.g., statistics, ANNs, metadata rules) to produce as many features as possible.

flowchart LR subgraph i[Input] n[Noise] end n--> pA n--> pB n--> pC n--> pD subgraph Magic subgraph pA[Predictor A] pr1[Prediction 1] end subgraph pB[Predictor B] pr2[Prediction 2] end subgraph pC[Predictor C] pr3[Prediction 3] end subgraph pD[Predictor D] pr4[Prediction 3] pr5[Prediction 4] end subgraph j[Judge] pr7[Prediction 1] pr8[Prediction 2] pr9[Prediction 3] pr10[Prediction 4] end end pr1 --> pr7 pr2 --> pr8 pr3 --> pr9 pr4 --> pr9 pr5 --> pr10 pr7 --> f1 pr8 --> f2 pr9 --> f3 pr10 --> f4 subgraph Signal f1[Feature 1] f2[Feature 2] f3[Feature 3] f4[Feature 4] end

Here we see that Predictor C and D both predict Feature 3, and the Judge can decide which method is a better fit to make the prediction.

Circle 5: Dependencies

Very quickly, you will find that to predict a feature. We depend on existing information. If we want to predict the topic of a text, a system would need to predict the language the text was written in. There might be better methods to predict the language than the method to predict the topic. In this case, predicting the topic is dependent on the language; language prediction is a dependency on the topic predictor.

flowchart LR subgraph i[Input] n[Noise] end n--> pA n--> pB n--> pC n--> pD subgraph Magic subgraph Layer1 subgraph pA[Predictor A] pr1[Prediction 1] end subgraph pB[Predictor B] pr2[Prediction 2] end subgraph pC[Predictor C] pr3[Prediction 3] end subgraph pD[Predictor D] pr4[Prediction 3] pr5[Prediction 4] end subgraph j1[Judge] pr7[Prediction 1] pr8[Prediction 2] pr9[Prediction 3] pr10[Prediction 4] end end pr10 --> d1 pr10 --> d3 subgraph Layer2 subgraph pE[Predictor E] d1[Dependency 1] d2[Dependency 2] pr11[Prediction 5] end subgraph pF[Predictor F] d3[Dependency 1] pr12[Prediction 5] pr13[Prediction 6] end subgraph j2[Judge] pr14[Prediction 5] pr15[Prediction 6] end pr11 --> pr14 pr12 --> pr14 pr13 --> pr15 end end pr1 --> pr7 pr2 --> pr8 pr3 --> pr9 pr4 --> pr9 pr5 --> pr10 pr7 --> f1 pr8 --> f2 pr9 --> f3 pr9 --> d2 pr10 --> f4 pr14-->f5 pr15-->f6 subgraph Signal f1[Feature 1] f2[Feature 2] f3[Feature 3] f4[Feature 4] f5[Feature 5] f6[Feature 6] end

We achieve this by splitting the entire prediction pipeline into several layers. Each layer is dependent on the output of the layer before. If each predictor knows its dependency, the system could use DFS , a dependency tree, and automatically generate layers.

Each layer should have a judge because you will come to a situation where there will be multiple predictor sources for a dependency. Each layer must output a single prediction per feature.

Circle 6: Given-Features

We apply the same logic of dependencies to the input layer. Sometimes we have given features: data provided to us before prediction begins. We should handle it as any other dependency. Sometimes this is meta-data; other times is a best guess from a different system.

flowchart LR s1[Source 1] s2[Source 2] s1-->gf1 s2-->gf3 s2-->gf2 subgraph i[Input] n[Noise] gf3[Given Feature 4] gf1[Given Feature 4] gf2[Given Feature 6] end n--> pA n--> pB n--> pC n--> pD subgraph Magic subgraph Layer1 subgraph pA[Predictor A] pr1[Prediction 1] end subgraph pB[Predictor B] pr2[Prediction 2] end subgraph pC[Predictor C] pr3[Prediction 3] end subgraph pD[Predictor D] pr4[Prediction 3] pr5[Prediction 4] end subgraph j1[Judge] pr7[Prediction 1] pr8[Prediction 2] pr9[Prediction 3] pr10[Prediction 4] end end gf1 --> pr10 gf3 --> pr10 pr10 --> d1 pr10 --> d3 subgraph Layer2 subgraph pE[Predictor E] d1[Dependency 2] d2[Dependency 1] pr11[Prediction 5] d1-->pr11 d2-->pr11 end subgraph pF[Predictor F] d3[Dependency 1] pr12[Prediction 5] pr13[Prediction 6] d3 --> pr12 d3 --> pr13 end subgraph j2[Judge] pr14[Prediction 5] pr15[Prediction 6] end gf2 --> pr15 pr11 --> pr14 pr12 --> pr14 pr13 --> pr15 end end pr1 --> pr7 pr2 --> pr8 pr3 --> pr9 pr4 --> pr9 pr5 --> pr10 pr7 --> f1 pr8 --> f2 pr9 --> f3 pr9 --> d2 pr10 --> f4 pr14-->f5 pr15-->f6 subgraph Signal f1[Feature 1] f2[Feature 2] f3[Feature 3] f4[Feature 4] f5[Feature 5] f6[Feature 6] end

Once you deal with a sophisticated production system like this, you will find that your provided data is only sometimes reliable. This is why it’s important also to measure the performance of the sources of your given features.

Remember that everything in this business is a migration. You will likely be implementing a prediction system alongside an existing service. To ensure safety and measure performance, you should pass the results of the legacy system as Given-Features to your new Magic System. This way, you can measure performance and safely migrate load to your new (way better) predictors.

Circle 7: Manual bias

The system prescribed so far is self-correcting. However, we have designed it to allow for control in the face of fuzziness. So being able to influence and override the outcomes manually is key. We can’t control the output of the predictors directly, but we don’t need to if we can control the outcomes of the judging phase.

Laconic predictors

Judges by themselves should use statistical analysis of the origin predictor’s performance to pick a winning prediction, but more is needed to cover edge cases. For example, you could have a ’laconic predictor: a predictor with a high level of accuracy but a low response rate. This predictor usually abstains, but when it does provide a prediction, it is highly accurate. This kind of predictor can be very valuable, but more eager and inaccurate predictors might overrule it. To solve this, you should be able to give this laconic predictor preferential treatment by the judges manually. The Judge first considers the laconic predictor’s results. If it abstains, the Judge considers the others equally.

Weight Ranking

Following this, you should be able to manually rank each of your predictors by confidence. Judges will first consider the ranking and apply their statistical analysis to tie-breakers. This sequence allows for the greatest manual control while allowing for drift and fitness.

Use the manual ranking system sparingly! The Judge’s tie-breaker lets the system self-heal in the case of drift. If you use it too rigidly, you’ve created a fragile rule-based system.

flowchart TD subgraph Feature6 s21[Source2]-->pF1[PredictorF]-->d2[default]-->d13[default] end subgraph Feature5 PredictorE-->pF2[PredictorF]-->d1[default]-->d12[default] end subgraph Feature4 Source1-->Source2-->pD1[PredictorD]-->d11[default] end subgraph Feature3 PredictorC --> pD2[PredictorD]-->d3[default]-->d10[default] end subgraph Feature2 PredictorB-->d4[default]-->d5[default]-->d9[default] end subgraph Feature1 PredictorA-->d6[default]-->d7[default]-->d8[default] end

Multiple judges

If you have multiple tie-breaking systems in mind and don’t know which is best, you could run them both in parallel and measure the results separately. This method effectively creates multiple judges that a meta-judge can evaluate. It will explode the variance in your system and bury it in overhead. I’ve run multiple judges in the past, and the juice has never been worth the squeeze. Do not recommend it!

Circle 8: Conditional Ranking

Note that the best historic performer will give you overall good results. However, the best historic performer could cause continuous problems for edge cases. This problem is similar to the problem of the laconic predictor but can be determined beforehand. Sometimes you know a certain predictor is very good at predicting under certain conditions.

Example

In a system predicting the number of birds on a picture, one predictor called ’night vision’ is very good at counting the birds, but only for pictures at night. We know when the pictures are at night or can reliably predict by another predictor. We can’t set the ranking to prefer night vision because most of our incoming pictures are day shots, and night vision is still very eager. We need to have multiple rankings.

Multiple rankings

These are similar to the multiple Judge problem, but we can determine which one to use beforehand. Each ranking of predictors comes with a condition. In the above example, we will have one default ranking and one ’night vision’ ranking that is only applied when we know that the picture we are predicting is a night picture. This sequence automatically applies the best predictors to the most applicable cases.

Circle 9: Prediction Arrays

When making a prediction, sometimes your system will give a collection of results. When dealing with a collection of signals, you need to keep in mind which of two kinds of signal arrays you are talking about:

Homogenerative arrays (Duplicate entities expected)

Say we have a doggie photobook, and we have to the entrants

A predictor will spit out an array of doggies, some of which will be the same doggie.

{
    "results":{
        "A":2,
        "B":7,
        "C":3,
        "D":1
    }
}

In this case, we can expect duplicate items of the same entity. We can use this signal to derive all kinds of things.

Heterogenerative arrays (Duplicates unexpected)

However, if we have a single picture of a doggo birthday party.

Then the system will return an array of unique doggos.

{
    "result":[
        "A, " "B, " "C, " "D"
    ]
}

Any duplicates should be either consolidated and added to the result or ignored due to low confidence.

Evaluating Arrays

When evaluating arrays, we come up against a problem. Array items are discrete predictions, but it could be more practical to evaluate them discretely. The larger the arrays, the higher the chance there will be some discrepancy that will invalidate your entire prediction.

For this reason, it is more practical to treat arrays as continuous predictions . Score the prediction based on how many items in your array were present in the evaluation and take penalties for the items present in the evaluation but not in the prediction.

# Something like this
def calculate_accuracy(prediction, evaluation):
    total_items = len(evaluation)
    correct_items = 0

    for item in prediction:
        if an item is actual:
            correct_items += 1

    accuracy = correct_items / total_items
    return accuracy

Circle 10: Nested Predictions

As soon as we are up to arrays, the requirement to predict features of the individual array items will arise. Once we predict who came to the doggo’s birthday party, it’s only a matter of time before the question comes to figure out what they were wearing, their name and color, and what they brought as a gift.

This problem is the final level of magic in production. If you have set up your system correctly, an individual predictor can start a new sub-operation targeting a different set of features as part of an existing operation. The predictor provides this entire context as provided fields for the new operation. The hosting operation continues when this new sub-operation resolves and the predictor uses the results as part of its prediction.

flowchart LR no([New Sub Operation]) --> n n[Noise] n--> pA n--> pB pA[PredictorA] pB[PredictorB] pB --> no pA --> f1 pB --> f2 f1[Feature 1] f2[Feature 2] f21[Subfeature 1] f22[Subfeature 2] f2-->f21 f2-->f22

Recursive Evaluation.

Each operation must carry its child sub-operations created recursively to the evaluator. The evaluation will happen simultaneously for the consumer, but the system must know which part of the evaluation must be attached to which sub-operation.

If we account for each feature and subfeature in reverse, the system can fully reference itself recursively.

All you need to predict sub-features and limitlessly predict and measure ever more granular levels is more compute and patience.

Conclusion

Wrangling indeterministic systems in production should be considered. The demands of these magical systems grow exponentially with their effectiveness. In addition, the drift of noisy inputs makes it inevitable to lose control. With continuous and safe experimentation and ample failover, your end-users will gain trust in your magical system.

If you’ve made it this far, please shoot me a message. What do you think of all this? This piece is a culmination of three different real-world applications. Each tackles several problems, but only some could span all 10. I’m curious about other real-world applications that try to safely deliver non-deterministic results to end users and tackle the problems of drift and granularity.

At the end of all of this, we see why such things are reserved for the largest, most dedicated, ambitious, and naive teams. If you have such a team and the juice is worth the squeeze, learn from the lessons of those of us who have built and migrated these systems and used them not as research projects but in the domain of tangible consequences. For everybody else, if you can at all avoid it, never build something like this yourself.

The 10 Circles of Magic in Production