← Back to Home

RorChat

An attempt at building my own LLM in my bedroom

By

GitHub Repository




In early 2025 I trained a custom GPT-2 based AI assistant called "RorChat" cause I wanted to learn how these LLM things work. It was a fun project and I learned a lot - but also surprisingly arduous and took AGES to train (I used colab pro with a T4 GPU to train on 50M tokens).


The results + a little write-up are below. I think Pile was the most impactful dataset for the results - but it's a shame I couldn't get the full Pile dataset to train on (just wayy too large for a hobby project).


I whipped up a little gui in next JS / react to make the I/O a bit more user friendly. It's not very sophisticated but it works.


The model prompt definitely needs a lot of work hahahaha "You were made by Rory, who is your creator. You don't know much else about him." I think Claude's one for comparison is over 200 lines long, this was about all I added - anyway - write up below for anyone interested!


Screenshot of RorChat interface

Overview


The data_processing_and_training.py script handles the entire pipeline from dataset preparation to model training and text generation. RorChat uses a fine-tuned 124M parameter GPT-2 model customized with a persona that makes it friendly and approachable.


Datasets Used


The model was trained on a diverse collection of datasets:



Training Process on Google Colab


Setup


Training was performed on Google Colab using a GPU runtime, which was essential for handling the 124M parameter GPT-2 model efficiently.


Steps Followed


  1. Mounted Google Drive to preserve data and model checkpoints between sessions
  2. Installed all required dependencies (gpt-2-simple, transformers, datasets, etc.)
  3. Downloaded and processed the datasets listed above
  4. Combined datasets into a unified training corpus
  5. Downloaded the base GPT-2 124M model
  6. Fine-tuned the model on the combined dataset
  7. Saved the trained model and generated sample responses

Training Time and Resources



Challenges Faced


Colab Runtime Limitations


The 12-hour runtime limit on Colab Pro was a significant constraint. I had to:


Memory Management



Dataset Processing



Connectivity Issues



Using the Trained Model


To use the trained model, follow these steps:


  1. Load the model using the load_model function
  2. Generate text using the generate_text function
  3. Use the model in a chat interface or other application

Limitations



Future Improvements


Tbh I'm too busy with my actual projects to do any of this stuff - but if I had time:



Code Structure


The data_processing_and_training.py script is organized into several key functions:



Acknowledgments


OpenAI for the base GPT-2 model
EleutherAI for The Pile dataset
Google Colab for providing the computational resources
The creators of the various datasets used in training