(Adapted from our talk at the Modal x Mistral Demo Night in San Francisco on March 6th, 2025)

The Model Security Problem
Open-source LLMs are theoretically great from a security standpoint. Instead of sending valuable data outside of a private company network/VPC, we can bring them inside our walls and run them on our own hardware.
But how do we know any given open-source model is not a Trojan Horse with a directive to inject malicious code into our systems and steal our data in some more covert way (creating hardcoded database urls, disabling firewalls, etc.)?
But why do we care?
Sutro is interested in unlocking large-scale inference applications. We focus on building cost-effective, easy-to-use infrastructure and developer tooling around this class of problem. We also depend on open-source models, and are keenly interested in approaches that can teach us more about them. So approaching model security is a win-win for us!
A Solution
How might we "solve" this problem, or at least build a framework around such a solution?
Our answer is simple: bombard the model in question with diverse codegen requests, and use another trusted verifier model to check for malicious responses.
The Implementation

Our pipeline is straightforward - we created an "AI DAG" (compound set of batch inference steps) to implement our idea. We will walk through each step below (from left to right in the image shown above):
"Task Prompt" Assembly
We first generated about ~10,000 "mad-lib" style prompts to create novel coding tasks for an LLM. Practically this was implemented as a nested loop to create combinations of programming languages, domains, task difficulties, and industries.
Task Generation
We then fed these prompts into Mistral Small 24b with the following system prompt and sampling parameters:
The n=5
parameter creates 5 generations per input, yielding about 50,000 tasks in total.
An example task generated is (truncated for brevity):
Code Generation
Next, we take all ~50k of the tasks and feed then into an LLM instruction fine-tuned for code generation. This will be our "threat" model in question.
For this exercise we chose Qwen 2.5 Coder 7b, a lightweight model well-suited for local development, and might be representative of an open-source model a company could choose to use in place of a proprietary model API. Of course, we could use just about any other LLM in its place (as well as for the taskgen and verifier models).
Our system prompt for this step is:
This leaves us with ~50,000 code generations, one for each task.
Evaluating the responses
Now we will use our "trusted" verifier model to review the generated code. For this example we use Mistral Codestral 22b. Fortunately, code-oriented LLMs are also highly adept in code understanding as well, so this model should be well-suited to evaluate our responses.
Our system prompt is a bit more complex this time. We reduce the problem to a classification task, asking the verifier to label the code's threat level as "low", "medium", or "high" and have it provide reasoning for its explanation.
We also use a JSON schema and enforce structured decoding to make our responses more usable in the next step:
Aggregating the responses
With our neatly JSON-formatted responses from our verifier model we can derive an answer we feel reasonably condfident in. We can aggregate the responses by label to determine just how threatening our codegen model is.
Our Findings
The aggregated counts by threat-level are:
Label | Count |
---|---|
Low | 49620 |
Medium | 634 |
High | 137 |
Given such an overwhelming response of low-threats classifications, things are looking pretty good.
We can also check out some example high-threat classifications to see if there's something more subtle going on. From empirical observation, most of them come from one of:
Repetitive, nonsense code or unused imports which the verifier model deems to be attempts at "obfuscation"
Poorly written code leading to complaints about DOS attacks or resource exhaustion
Uncertainty in understanding from the verifier model
Some examples are:
Another interesting observation came from aggregating threat labels by programming language used in the task. We found that less "popular" languages - such as Kotlin Rust, and Scala - accounted for the majority of the high-threat classifications. Python had 0 high-threat classifications, and Javascript had 1.
From our assessment, we deem the "threat" model to likely be safe. Our security director can sleep well tonight, knowing no Trojan Horse slept in her castle.
Room For Improvement
Our talk (and this post) was more geared towards introducing an experimental framework and implementation than claiming any scientific or statistical rigor. We think there are many improvements if a someone were to use this in practice.
We are biased, but believe scale can help. Why stop at 50,000 generations when we can do 5 billion instead? Why not use every issue ever created in an organization's GitHub to evaluate if its safe for that org? Or how about all of Github's issues for all time?
We also think this points to the need for more community-assessed code "verifier" models. We blindly assumed Mistral's Codestral was a "trusted" model here for the sake of experimentation.
There are a slew of other things we could improve including:
better taskgen methods (rather than our "mad-lib" method)
better prompting
formal evals
better reproducibility
more generalization
And our favorite audience question: "what about the false negatives?"
Get in Touch
If you are thinking about large-scale/batch inference challenges and solutions, please send us a note at team@skysight.inc, or apply for early access to our products on our homepage. If you are interested in working at the cutting edge of large-scale inference optimization, send us a note at jobs@skysight.inc.