Wink - AI原生创新，忠于用户，专属智能体验

Recently on Reddit, I came across a developer sharing their LLM training tool called 'cleanai' written in C language. Its biggest feature is simplicity—you can start training a model with just two commands.

```bash

cleanai --init-config config.json

cleanai --new --config config.json --pretrain --train

```

The first command guides users through generating a configuration file, explaining the purpose of each parameter in the process. The second command directly creates a new model and performs pretraining and fine-tuning. The entire tool is implemented entirely in C language without any dependencies on machine learning libraries.

![CLI interface example](https://via.placeholder.com/600x300)

*The tool's command line interface*

Some netizens asked whether it supports dataset acquisition and tokenization features. The author responded that datasets need to be prepared by users, but tokenization functionality is already built-in. In the configuration builder, users only need to provide file paths for pretraining and fine-tuning datasets; the rest of the tokenization and processing work is automatically completed by the program.

Pretraining data can be unstructured text like Wikipedia dumps, used to teach the model to understand language; fine-tuning data needs to follow the JSON format specified in the README, used to train the model's specific task capabilities.

The tool also provides a checkpoint feature after each training round—training pauses after each epoch, giving users 30 seconds to decide whether to stop training, test the model, or adjust parameters. If no action is taken, it automatically continues.

The project is currently open source on GitHub (https://github.com/willmil11/cleanai-c) with installation scripts provided. It's worth noting that installation requires fish shell, which can be installed via package managers on major Linux distributions and macOS. The author specifically reminds that if clang is aliased to gcc in the system, the installation script will detect this and require the actual GCC command to be provided.

This project's concept is quite interesting—reimplementing the LLM training process using the low-level C language, avoiding the bloat of frameworks like PyTorch. While it may not be as feature-complete as mature frameworks, it's a valuable reference for those who want to deeply understand LLM training principles or need ultimate performance.

Wink Pings

I Built an LLM Training Tool in C Language with Just 2 Commands, No PyTorch Required