I ran 4,470 trials across three language models (Claude Haiku, GPT-4o-mini, Gemini Flash Lite) on seven reasoning tasks, constraining them to write in E-Prime (no "to be") or without possessive "to have." The constraints don't uniformly help — they reshape reasoning in task-specific and model-specific ways.
Key findings:
-No-Have improves ethical reasoning by 19pp (p<0.001) and epistemic calibration by 7.4pp across all models
-E-Prime improves Gemini's ethical reasoning by 42pp but collapses GPT-4o-mini's epistemic calibration by 27pp
-Cross-model correlations reach r=-0.75 — the same constraint helps one model and hurts another
-A 3-agent ensemble using linguistically diverse constraints hits 100% coverage on debugging problems vs 88% for the unconstrained control
The idea: for an LLM, language isn't a medium through which cognition passes — it IS the cognition. Designing the vocabulary an agent reasons in is a distinct engineering discipline from prompt or context engineering. I call it "Umwelt engineering" after Jakob von Uexküll's concept of an organism's perceptual world.
I fail to understand why people anthropomorphize LLMs. They are word calculators. Sure, impressive ones. Useful ones. But still just calculators. So it should be self-evident that limiting their output will change their output. It may be interesting to note how wide those changes are, but that is all it is.
Also, original title is: "Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents". HN frowns on editorializing titles. From the guidelines: "Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize."
Fair point on the title. I've emailed to mods to see if they can update the title.
I don't think it's anthropomorphizing to study how vocabulary constraints change reasoning quality. The paper doesn't claim LLMs think. It measures accuracy on tasks with known correct answers under different constraints and finds structured patterns.
"Limiting output changes output" is true but undersells what's happening. If you removed random words from a calculator's input language you'd expect degraded or noisy results. Instead, removing possessive "to have" (so the model can't say "the argument has a flaw" and has to say "the argument fails because...") improves ethical reasoning by 19pp across all three models. Removing "to be" helps Gemini by 42pp on that same task but collapses GPT-4o-mini by 27pp on a different one. The cross-model correlation is r=-0.75, meaning the same restriction systematically helps one model and hurts another.
That's not just different output. The restrictions are forcing different reasoning paths depending on the task and the model. Why specific vocabulary removals produce specific, predictable accuracy changes is the question. Running a 15,600-trial follow-up now to dig into it further.
I ran 4,470 trials across three language models (Claude Haiku, GPT-4o-mini, Gemini Flash Lite) on seven reasoning tasks, constraining them to write in E-Prime (no "to be") or without possessive "to have." The constraints don't uniformly help — they reshape reasoning in task-specific and model-specific ways.
Key findings:
-No-Have improves ethical reasoning by 19pp (p<0.001) and epistemic calibration by 7.4pp across all models -E-Prime improves Gemini's ethical reasoning by 42pp but collapses GPT-4o-mini's epistemic calibration by 27pp -Cross-model correlations reach r=-0.75 — the same constraint helps one model and hurts another -A 3-agent ensemble using linguistically diverse constraints hits 100% coverage on debugging problems vs 88% for the unconstrained control
The idea: for an LLM, language isn't a medium through which cognition passes — it IS the cognition. Designing the vocabulary an agent reasons in is a distinct engineering discipline from prompt or context engineering. I call it "Umwelt engineering" after Jakob von Uexküll's concept of an organism's perceptual world.
Paper: https://arxiv.org/abs/2603.27626 Code + data: https://github.com/rodspeed/umwelt-engineering
I fail to understand why people anthropomorphize LLMs. They are word calculators. Sure, impressive ones. Useful ones. But still just calculators. So it should be self-evident that limiting their output will change their output. It may be interesting to note how wide those changes are, but that is all it is.
Also, original title is: "Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents". HN frowns on editorializing titles. From the guidelines: "Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize."
https://news.ycombinator.com/newsguidelines.html
Fair point on the title. I've emailed to mods to see if they can update the title.
I don't think it's anthropomorphizing to study how vocabulary constraints change reasoning quality. The paper doesn't claim LLMs think. It measures accuracy on tasks with known correct answers under different constraints and finds structured patterns.
"Limiting output changes output" is true but undersells what's happening. If you removed random words from a calculator's input language you'd expect degraded or noisy results. Instead, removing possessive "to have" (so the model can't say "the argument has a flaw" and has to say "the argument fails because...") improves ethical reasoning by 19pp across all three models. Removing "to be" helps Gemini by 42pp on that same task but collapses GPT-4o-mini by 27pp on a different one. The cross-model correlation is r=-0.75, meaning the same restriction systematically helps one model and hurts another.
That's not just different output. The restrictions are forcing different reasoning paths depending on the task and the model. Why specific vocabulary removals produce specific, predictable accuracy changes is the question. Running a 15,600-trial follow-up now to dig into it further.