Skip to content
Dashboard

Testing if "bash is all you need"

Models inherit shell fluency from coding-heavy training data

Link to headingSetting up the eval

Link to headingInitial results

Sophisticated shell scripting that didn't translate to accuracy
Sophisticated shell scripting that didn't translate to accuracy

Link to headingDebugging the results

Link to headingThe hybrid approach

Link to headingKey learnings

The hybrid approach matched SQL on accuracy while adding self-verification
The hybrid approach matched SQL on accuracy while adding self-verification

Link to headingWhat this means for agent design

Link to headingRun your own benchmarks

Ready to deploy?