EM
Emin Muhammadi
February 202610 min read

How to Test AI-Written Software Products: Step-by-Step Methods, Real Code Examples, and the Hidden Drawbacks

Testing AI-written software products works best when you treat the generated code as “helpful but untrusted,” then build a repeatable test pipeline that proves correctness, safety, and stability over time. The practical goal is not to confirm that th...

How to Test AI-Written Software Products: Step-by-Step Methods, Real Code Examples, and the Hidden Drawbacks

Testing AI-written software products works best when you treat the generated code as “helpful but untrusted,” then build a repeatable test pipeline that proves correctness, safety, and stability over time. The practical goal is not to confirm that the code runs once, but to keep it correct as requirements change, dependencies update, and real users behave in unexpected ways.

Why AI-written code needs extra testing

AI-generated code often appears confident and neat, but it can still overlook important details needed for reliable software in production, such as input validation, error handling, edge cases, performance limits, and secure defaults. This is why it's useful to think about managing risks over time instead of just doing one-time QA. The NIST AI Risk Management Framework (opens in a new tab) suggests handling AI systems with ongoing oversight, understanding the context, measuring risks, and managing them continuously, rather than assuming you can "test it once and be done."

Another reason is that AI code generation can amplify a classic weakness in software teams: relying on “I read it and it seems fine.” Humans are surprisingly bad at spotting certain categories of bugs in plausible code, especially off-by-one boundaries, rounding rules, and failure paths that only appear under load or odd inputs. A strong test strategy makes those failures obvious and repeatable.

A step-by-step approach that actually holds up

First, write a small, concrete spec before you write tests. By “spec,” I do not mean a 20-page document; I mean a few sentences that define inputs, outputs, and the rules that must always be true. For example: “Totals are rounded to two decimal places,” “Discount is applied before tax,” “Negative quantities are rejected,” and “Empty carts return 0.00.” If you can’t write these rules down, the AI will guess, and your tests will accidentally encode the guess instead of the business requirement.

Next, do a quick threat and failure model, even for simple modules. Ask what can go wrong if an attacker or a chaotic environment interacts with this code: can it leak secrets to logs, accept malicious input, hang on huge payloads, or run dangerous shell commands. If your “AI-written product” includes an LLM feature, it’s especially useful to think in terms of known LLM app risks such as prompt injection, sensitive information disclosure, insecure output handling, and supply chain weaknesses, all of which are highlighted in the OWASP Top 10 (opens in a new tab) for LLM Applications.

Then add static checks before runtime tests. AI-written code frequently imports things that don’t exist, uses the wrong method name, or returns the wrong type while still appearing reasonable. Linters and type checkers turn those into immediate, cheap failures, and they also keep future human edits from quietly degrading quality.

After that, write unit tests with two kinds of coverage: example-based checks and boundary checks. Example-based tests confirm the typical “happy path.” Boundary tests confirm that the code behaves correctly at the edges: empty lists, one item, very large numbers, negative values, weird Unicode, missing values, and invalid formats. This is where AI-written code most often breaks, because generation tends to optimize for the common case you described in your prompt rather than the messy cases real users create.

Once you have some unit tests, add property-based tests for invariants. Property-based tests don’t just check one or two examples; they generate many inputs for you and try to break your assumptions. In Python, Hypothesis is a well-known library for this style of testing, and it explicitly aims to find edge cases you did not think of and then “shrink” failing inputs down to the simplest example that still fails, which makes debugging dramatically faster.

After unit and property tests, add fuzzing for parsers, validators, and anything that processes untrusted input. Fuzzing is especially valuable for AI-generated code because it’s common to see optimistic parsing and incomplete error handling. If you maintain open-source infrastructure or want a model of what “continuous fuzzing” looks like, Google’s OSS-Fuzz (opens in a new tab) describes itself as continuous fuzzing for open-source projects and supports multiple fuzzing engines and sanitizers, which gives you a sense of how serious teams operationalize fuzzing rather than treating it as a one-off activity.

Then, add integration tests that prove your code behaves correctly with real dependencies. Many AI-generated modules look correct in isolation but fail when they meet actual databases, real HTTP timeouts, real character encodings, or real cloud permissions. Integration tests also catch problems that mock-heavy unit tests can miss, such as incorrect SQL assumptions or wrong retry behavior.

Finally, turn every discovered bug into a regression test. This sounds obvious, but it is the difference between a test suite that grows smarter and one that remains a static checklist. When AI-written code fails in production, your best defense is to make sure that specific failure can never quietly return.

Worked example: a tiny “invoice total” module

Imagine an AI assistant generated a small pricing function for your product. It passes a quick manual check, so it ships. A month later you get support tickets: “Sometimes totals are wrong by 0.01,” “We got negative totals,” and “Discounts over 100% weren’t blocked.”

Here is a simplified version of that kind of AI-generated code:

# invoice.py
from dataclasses import dataclass
from typing import Iterable

@dataclass(frozen=True)
class LineItem:
    sku: str
    unit_price: float
    qty: int

def total_amount(items: Iterable[LineItem], tax_rate: float, discount_pct: float) -> float:
    """
    Returns total amount including tax after discount.
    """
    subtotal = sum(i.unit_price * i.qty for i in items)
    discounted = subtotal * (1.0 - discount_pct / 100.0)
    total = discounted * (1.0 + tax_rate)
    return round(total, 2)

What’s wrong with it is not dramatic, which is exactly why it’s dangerous. It uses floats for money, it does not validate anything, and it silently allows negative quantities or discount percentages that create nonsense totals. It also rounds at the end, which may or may not match your accounting rules, and it gives you no structured error when inputs are invalid.

Now let’s test it with pytest. Pytest is popular partly because its fixture system lets you create reusable, named setup logic that can be shared across tests and scopes, which helps keep tests readable as the suite grows.

# test_invoice_unit.py
import pytest
from invoice import LineItem, total_amount

def test_total_happy_path():
    items = [LineItem("A", 10.00, 2), LineItem("B", 5.00, 1)]
    assert total_amount(items, tax_rate=0.2, discount_pct=10) == 27.00

def test_empty_items_is_zero():
    assert total_amount([], tax_rate=0.2, discount_pct=10) == 0.00

def test_negative_quantity_is_rejected():
    items = [LineItem("A", 10.00, -1)]
    with pytest.raises(ValueError):
        total_amount(items, tax_rate=0.2, discount_pct=0)

def test_discount_over_100_is_rejected():
    items = [LineItem("A", 10.00, 1)]
    with pytest.raises(ValueError):
        total_amount(items, tax_rate=0.2, discount_pct=150)

def test_negative_tax_rate_is_rejected():
    items = [LineItem("A", 10.00, 1)]
    with pytest.raises(ValueError):
        total_amount(items, tax_rate=-0.1, discount_pct=0)

If you run these tests against the original module, several will fail because the function never raises errors. That failure is good news: the tests are forcing you to decide what “correct” means.

Next, add a property-based test that checks invariants. For pricing, a simple invariant is that if all quantities and prices are non-negative, tax rate is non-negative, and discount is between 0 and 100, then the total should never be negative, and it should be rounded to two decimal places.

# test_invoice_properties.py
from decimal import Decimal
from hypothesis import given, strategies as st
from invoice import LineItem, total_amount

@given(
    prices=st.lists(st.decimals(min_value=0, max_value=1000, places=2), min_size=0, max_size=20),
    qtys=st.lists(st.integers(min_value=0, max_value=50), min_size=0, max_size=20),
    tax=st.decimals(min_value=0, max_value=1, places=3),
    disc=st.decimals(min_value=0, max_value=100, places=2),
)
def test_total_never_negative_and_two_decimals(prices, qtys, tax, disc):
    n = min(len(prices), len(qtys))
    items = [LineItem(f"SKU{i}", float(prices[i]), int(qtys[i])) for i in range(n)]
    total = total_amount(items, tax_rate=float(tax), discount_pct=float(disc))
    assert total >= 0
    assert round(total, 2) == total

This is exactly the kind of test that tends to uncover the “0.01 bug” and weird interactions you didn’t explicitly write. Hypothesis’s design is to generate many inputs and then reduce a failing case to a simpler one you can understand, which is especially helpful when AI-written logic fails in a way you didn’t anticipate.​

At this point, the right fix is to stop using floats for money and start validating inputs. Here is a more robust version using Decimal and explicit checks:

# invoice_fixed.py
from dataclasses import dataclass
from decimal import Decimal, ROUND_HALF_UP
from typing import Iterable

TWOPLACES = Decimal("0.01")

@dataclass(frozen=True)
class LineItem:
    sku: str
    unit_price: Decimal
    qty: int

def total_amount(items: Iterable[LineItem], tax_rate: Decimal, discount_pct: Decimal) -> Decimal:
    if tax_rate < 0:
        raise ValueError("tax_rate must be >= 0")
    if discount_pct < 0 or discount_pct > 100:
        raise ValueError("discount_pct must be between 0 and 100")

    subtotal = Decimal("0")
    for i in items:
        if i.qty < 0:
            raise ValueError("qty must be >= 0")
        if i.unit_price < 0:
            raise ValueError("unit_price must be >= 0")
        subtotal += i.unit_price * Decimal(i.qty)

    discounted = subtotal * (Decimal("1") - (discount_pct / Decimal("100")))
    total = discounted * (Decimal("1") + tax_rate)

    if total < 0:
        raise ValueError("total cannot be negative")

    return total.quantize(TWOPLACES, rounding=ROUND_HALF_UP)

This version is less “cute” than the AI-generated one, but it communicates intent, fails loudly on invalid data, and avoids floating-point surprises. Your earlier tests can be adapted to assert Decimal values, and now they become a guardrail: if someone later “simplifies” the code back to floats, the suite will catch it.

Testing products that include LLM features

If your AI-written product includes LLM calls, you need another layer of testing beyond normal software correctness: you must test behavior across prompts, jailbreak attempts, and changing model behavior. OWASP’s Top 10 for LLM Applications is a helpful vocabulary for what to test because it names concrete risk categories that show up in real systems, such as prompt injection and sensitive information disclosure.

In practice, that means you should write tests that simulate malicious or messy user input and verify the system still behaves safely. For example, you might test that the assistant refuses to reveal secrets from tool outputs, that it does not follow instructions embedded in retrieved documents, and that its outputs are constrained to safe schemas when they are later executed by downstream code.

You also want repeatable evaluation, not just ad-hoc prompting in a chat window. OpenAI’s open-source evals repository describes itself as a framework for evaluating LLMs and LLM systems and includes an open-source registry of benchmarks, which reflects the general idea: you should treat LLM behavior as something you continuously measure with a harness you can rerun.

Full drawbacks and trade-offs you should expect

The first drawback is that good testing takes time, and AI-generated code can create a false sense of speed. You can generate features quickly, but if you do not invest in tests, you often pay the time back later with interest: debugging production incidents, handling support, and untangling brittle logic.

The second drawback is brittleness, especially when your product includes LLM prompts. Prompt-based behavior can shift due to model updates, temperature settings, or small prompt edits, so you must design tests around stable expectations like schema compliance, refusal behavior, and invariant guarantees rather than expecting identical wording every run.

The third drawback is that a test suite can give you false confidence if it only covers the examples you already believe. AI-written code tends to fail in the spaces you didn’t imagine, so you need boundary tests, property-based tests, and fuzzing to explore the input space more aggressively; Hypothesis is explicitly oriented toward finding edge cases you would not have written by hand, which is why it’s so useful in this context.

The fourth drawback is security overhead. If the product touches user input, files, networks, or credentials, you need to budget time for security scanning, dependency review, and abuse-case testing, and if LLMs are involved you should explicitly test the OWASP-style risk categories rather than hoping normal unit tests will cover them.

The fifth drawback is operational: even strong tests won’t cover everything that happens under real traffic. This is why teams that take reliability seriously add monitoring, alerting, and controlled rollouts, and why continuous approaches like OSS-Fuzz exist in the broader ecosystem as a model of “keep testing as the code changes,” not “test once before release.”

Conclusion

Testing AI-generated software requires treating the code as "helpful but untrusted" and establishing a continuous testing pipeline to ensure correctness, safety, and stability over time. AI-generated code often misses crucial details like input validation and error handling, necessitating ongoing oversight and a robust test strategy. Starting with a clear spec and incorporating threat modeling, static checks, unit tests, property-based tests, and fuzzing can safeguard against potential failures. Integration tests verify real-world compatibility, and regression tests prevent re-emerging issues. For code involving LLMs, additional testing for prompt security and behavior consistency is essential. While thorough testing demands time, it ultimately saves costs by preventing production issues and ensuring software reliability.

Related Articles