Hi HN, I've been spending some time lately trying to build Reinforcement Learning Environments and training small language models and wanted to share a little course I put together based on my experiments.
Over the past year, we've seen a shift in LLM Post-Training. Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs. Now with RLVR and GRPO, we can make models learn through trial and error in dynamic environments, which are software artifacts.
But how to effectively build RL environments?
In the repo, I cover:
- Mapping core RL concepts (Agents, Environments) to the LLM domain.
- Using the Verifiers open-source library to construct single-turn, multi-turn, and tool-use environments.
- Hands-on: taking a small language model (LiquidAI's LFM2-2.6B) and turning it into a Tic-Tac-Toe master that beats GPT-5-mini. Build the game Environment, ese it to generate synthetic data for SFT warm-up, then Group-based Reinforcement Learning.
Hi HN, I've been spending some time lately trying to build Reinforcement Learning Environments and training small language models and wanted to share a little course I put together based on my experiments.
Over the past year, we've seen a shift in LLM Post-Training. Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs. Now with RLVR and GRPO, we can make models learn through trial and error in dynamic environments, which are software artifacts.
But how to effectively build RL environments?
In the repo, I cover:
- Mapping core RL concepts (Agents, Environments) to the LLM domain.
- Using the Verifiers open-source library to construct single-turn, multi-turn, and tool-use environments.
- Hands-on: taking a small language model (LiquidAI's LFM2-2.6B) and turning it into a Tic-Tac-Toe master that beats GPT-5-mini. Build the game Environment, ese it to generate synthetic data for SFT warm-up, then Group-based Reinforcement Learning.
---
Links
Course: https://github.com/anakin87/llm-rl-environments-lil-course
Video walkthrough: https://www.youtube.com/watch?v=71V3fTaUp2Q
Play against the trained model: https://huggingface.co/spaces/anakin87/LFM2-2.6B-mr-tictacto...
Datasets and Models on HF: https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-...
---
I'm fascinated by the idea of building these "little worlds" where LLMs can learn, so I hope it's useful.
Feel free to share opinions...