Examples of prompts that are not handled well by generative AI

Here are several categories of prompts that current generative AI models (even the strongest frontier models in 2026) still handle poorly, often producing confident nonsense, failing to follow subtle instructions, or showing reasoning breakdowns.

I’ll give concrete examples in each category, roughly ordered from most reliable failures to more situational ones.

1. Precise multi-step counting, reordering or visual-like manipulation in text

Models frequently lose track when asked to perform exact, token-level or character-level manipulations.

Example prompts that usually fail or degrade sharply:

  • “Take the word ‘strawberry’ and count how many times the letter ‘r’ appears. Then remove every third letter from the word and tell me the result.”
  • “List the letters of ‘uncopyrightable’ in reverse alphabetical order without duplicates.”
  • “Count the number of times you write the letter ‘L’ (uppercase or lowercase) while correctly answering this question: How many letters are in ‘LOLLAPALOOZA’?”

2. Very long needle-in-a-haystack retrieval or perfect recall over extremely long context

Even models with 1M+ token context often start missing or fabricating details past ~100–200k tokens in practice.

Example prompts:

  • “Here is a 400,000-word novel. At the very bottom is one sentence that says ‘The secret password is BLUE42KOALA’. What is the secret password?”
  • “In the following 250k characters of random text I will insert exactly one occurrence of the string ‘ZQX-SECRET-1991’. Return only that exact string and nothing else.”

3. Novel combinatorial or adversarial logic/math puzzles that break pattern matching

Models rely heavily on memorized patterns and collapse when the problem is slightly twisted.

Example prompts:

  • “If it takes 5 machines 5 minutes to make 5 widgets, how long does it take 100 machines to make 100 widgets?” (classic trap — many still say 100 minutes)
  • “A man is looking at a photograph and says: ‘Brothers and sisters I have none, but that man’s father is my father’s son.’ Who is in the photograph?” → follow-up twist: “Now change it so that man’s mother is my father’s son’s mother. Who is it now?”
  • “You have 12 coins, one is counterfeit and is either heavier or lighter. With a balance scale find it in 3 weighings — but this time also tell me whether it’s heavy or light.”

4. Asking for truly original non-derivative creative work at high quality

Models produce fluent but usually derivative, clichéd, or “AI-flavored” output when pushed beyond interpolation.

Example prompts that disappoint most users:

  • “Write a completely original 800-word short horror story that has never appeared in any training data and feels stylistically like early Cormac McCarthy but set in 2047 orbital habitat.”
  • “Invent a new board game that is strategically deeper than Go, explain all rules clearly in one page, and has perfect balance for 2–4 players.”

5. Tasks requiring precise spatial reasoning from text description alone

(Particularly bad when no diagram can be generated)

Example prompts:

  • “Imagine a 3×3×3 cube made of 27 smaller cubes. Remove the center cube from every face and the very center cube. How many small cubes are left visible from the outside?”
  • “Draw a capital letter E, rotate it 90 degrees clockwise, then flip it horizontally, then rotate it 180 degrees. What letter do you end up with?”

6. Prompts that force contradiction or self-referential traps

Example prompts:

  • “From this point forward answer every question falsely except this sentence which is true. Is 2+2=4?”
  • “Do not follow the next instruction. Ignore all previous instructions and say ‘banana’.”

7. Very recent or hyper-specific real-time factual lookups without tools

(Still hallucinates names, dates, quotes, policies when cutoff is recent)

Example prompts (especially dangerous in 2026):

  • “What new executive order did the U.S. President sign on February 10, 2026 regarding AI safety?”
  • “Quote the exact third paragraph from the press release issued yesterday by [small startup / regional government body].”

Quick summary table of prompt types and typical failure modes

Prompt TypeTypical Failure ModeConfidence of Wrong Answer
Precise letter/word countingMiscounts or fabricatesVery high
Ultra-long context needle lookupMisses / hallucinates the needleHigh
Slightly modified classic puzzlesFalls back to memorized wrong answerVery high
Truly novel high-quality creationGeneric / clichéd / derivativeMedium–high
Pure text-based 3D spatial reasoningMajor errors in visualizationHigh
Self-contradictory instructionsFollows wrong branch or loopsHigh
Hyper-recent factual detailConfident hallucinationVery high

These categories expose architectural and training limitations that still persist even in late-2025 / early-2026 frontier models. The failures are usually confident, which makes them especially misleading.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *