Detection Accuracy
We publish our accuracy data so you can make informed decisions. Every number on this page comes from automated testing against a labeled corpus of known human and AI-generated text.
Pro Tier — Neural Model
The Pro tier uses a fine-tuned neural model purpose-built for AI detection. It achieves near-perfect AI detection (99.9%) while correctly passing 82.1% of human text. The main source of false positives is formal academic and clinical writing, where human prose closely resembles AI output.
Pro vs Free Tier Comparison
| Metric | Pro (Neural) | Free (Heuristic) |
|---|---|---|
| AI correctly flagged | 99.9% | 23.1% |
| Human correctly passed | 82.1% | 68.9% |
| False positive rate | 17.9% | 31.1% |
| Human avg score | 84.1 | 77.9 |
Free Tier — Heuristic Scanner
The free tier uses a 5-signal heuristic scorer that checks burstiness, AI vocabulary, perplexity, structural patterns, and human texture markers. It provides a fast initial estimate but has limited accuracy — especially on formal writing where human and AI styles overlap. For reliable results, upgrade to the Pro neural model.
Human Text — By Source
How often does each type of human writing pass without being falsely flagged?
| Source | Samples | Pro (Neural) | Free (Heuristic) |
|---|---|---|---|
| Casual online writing | 447 | 99.3% | 69.4% |
| Academic papers | 226 | 63.7% | 51.8% |
| Non-native English | 196 | 95.4% | 77.6% |
| Clinical / medical | 178 | 34.3% | 75.8% |
| Creative fiction | 160 | 90% | 80.6% |
| Articles / essays | 61 | 100% | 50.8% |
Formal writing (academic papers, clinical text) has higher false positive rates across all AI detectors — formal prose shares structural features with AI output. Auto tone detection adjusts thresholds for detected domains (Academic, Medical, Legal, etc.) to reduce these false positives.
AI Text — By Model
How often does AI-generated text from each model get correctly flagged?
| AI Model | Samples | Pro (Neural) | Free (Heuristic) |
|---|---|---|---|
| AI Model A | 750 | 100% | 10.1% |
| AI Model B | 750 | 100% | 36.5% |
| AI Model C | 750 | 99.9% | 23.1% |
| AI Model D | 540 | 100% | 21.5% |
| AI (mixed models) | 75 | 98.7% | 29.3% |
Methodology
Corpus
4,133 labeled text samples: 1,268 known human and 2,865 known AI-generated. Human samples span casual writing, academic papers, clinical abstracts, creative fiction, and non-native English. AI samples are generated by multiple leading models across varied writing styles and topics.
Date filtering
All human samples are sourced from before January 2023 to ensure they predate widespread LLM usage. This gives us high confidence that "human" labels are accurate.
Models
The Pro neural model is a fine-tuned transformer trained on a curated corpus of verified human and AI-generated text. The free heuristic scorer uses five additive signals: sentence length variance (burstiness), AI vocabulary density, perplexity proxy, structural patterns, and human texture markers.
Scoring
Each sample is scored on a 0–100 scale (higher = more human). Samples scoring above 75 are classified as "Human," 46–75 as "Mixed," and 45 or below as "AI Detected." A human sample is "correctly passed" if it scores above 75. An AI sample is "correctly flagged" if it scores 75 or below.
Known limitations
The neural model struggles with formal academic and clinical writing, where human prose closely resembles AI output. The corpus skews toward English-language text. AI writing styles evolve; we re-run this benchmark periodically and retrain the model as needed.
This data is generated automatically and updates when we retrain or expand our test corpus. We believe publishing accuracy data — including where we fall short — builds more trust than marketing claims.