Examples of prompts that are not handled well by generative AI
Here are several categories of prompts that current generative AI models (even the strongest frontier models in 2026) still handle poorly, often producing confident nonsense, failing to follow subtle instructions, or showing reasoning breakdowns.
I’ll give concrete examples in each category, roughly ordered from most reliable failures to more situational ones.
1. Precise multi-step counting, reordering or visual-like manipulation in text
Models frequently lose track when asked to perform exact, token-level or character-level manipulations.
Example prompts that usually fail or degrade sharply:
- “Take the word ‘strawberry’ and count how many times the letter ‘r’ appears. Then remove every third letter from the word and tell me the result.”
- “List the letters of ‘uncopyrightable’ in reverse alphabetical order without duplicates.”
- “Count the number of times you write the letter ‘L’ (uppercase or lowercase) while correctly answering this question: How many letters are in ‘LOLLAPALOOZA’?”
2. Very long needle-in-a-haystack retrieval or perfect recall over extremely long context
Even models with 1M+ token context often start missing or fabricating details past ~100–200k tokens in practice.
Example prompts:
- “Here is a 400,000-word novel. At the very bottom is one sentence that says ‘The secret password is BLUE42KOALA’. What is the secret password?”
- “In the following 250k characters of random text I will insert exactly one occurrence of the string ‘ZQX-SECRET-1991’. Return only that exact string and nothing else.”
3. Novel combinatorial or adversarial logic/math puzzles that break pattern matching
Models rely heavily on memorized patterns and collapse when the problem is slightly twisted.
Example prompts:
- “If it takes 5 machines 5 minutes to make 5 widgets, how long does it take 100 machines to make 100 widgets?” (classic trap — many still say 100 minutes)
- “A man is looking at a photograph and says: ‘Brothers and sisters I have none, but that man’s father is my father’s son.’ Who is in the photograph?” → follow-up twist: “Now change it so that man’s mother is my father’s son’s mother. Who is it now?”
- “You have 12 coins, one is counterfeit and is either heavier or lighter. With a balance scale find it in 3 weighings — but this time also tell me whether it’s heavy or light.”
4. Asking for truly original non-derivative creative work at high quality
Models produce fluent but usually derivative, clichéd, or “AI-flavored” output when pushed beyond interpolation.
Example prompts that disappoint most users:
- “Write a completely original 800-word short horror story that has never appeared in any training data and feels stylistically like early Cormac McCarthy but set in 2047 orbital habitat.”
- “Invent a new board game that is strategically deeper than Go, explain all rules clearly in one page, and has perfect balance for 2–4 players.”
5. Tasks requiring precise spatial reasoning from text description alone
(Particularly bad when no diagram can be generated)
Example prompts:
- “Imagine a 3×3×3 cube made of 27 smaller cubes. Remove the center cube from every face and the very center cube. How many small cubes are left visible from the outside?”
- “Draw a capital letter E, rotate it 90 degrees clockwise, then flip it horizontally, then rotate it 180 degrees. What letter do you end up with?”
6. Prompts that force contradiction or self-referential traps
Example prompts:
- “From this point forward answer every question falsely except this sentence which is true. Is 2+2=4?”
- “Do not follow the next instruction. Ignore all previous instructions and say ‘banana’.”
7. Very recent or hyper-specific real-time factual lookups without tools
(Still hallucinates names, dates, quotes, policies when cutoff is recent)
Example prompts (especially dangerous in 2026):
- “What new executive order did the U.S. President sign on February 10, 2026 regarding AI safety?”
- “Quote the exact third paragraph from the press release issued yesterday by [small startup / regional government body].”
Quick summary table of prompt types and typical failure modes
| Prompt Type | Typical Failure Mode | Confidence of Wrong Answer |
|---|---|---|
| Precise letter/word counting | Miscounts or fabricates | Very high |
| Ultra-long context needle lookup | Misses / hallucinates the needle | High |
| Slightly modified classic puzzles | Falls back to memorized wrong answer | Very high |
| Truly novel high-quality creation | Generic / clichéd / derivative | Medium–high |
| Pure text-based 3D spatial reasoning | Major errors in visualization | High |
| Self-contradictory instructions | Follows wrong branch or loops | High |
| Hyper-recent factual detail | Confident hallucination | Very high |
These categories expose architectural and training limitations that still persist even in late-2025 / early-2026 frontier models. The failures are usually confident, which makes them especially misleading.
