LLMs for dummies- walkthrough guide and glossary
A small glossary for those learning
Transformer — More than meets the eye…A type of model used in machine learning, especially for handling sequences of data like text or audio. It’s good at understanding the context in sentences and can be used for translating languages, summarising text, or generating chatbot responses.
Large Language Model (LLM) — It’s like a huge database of language knowledge that can write articles, answer questions, or create realistic dialogues.
A Transformer is a technique used in AI for processing language. An LLM is a big AI model for language tasks, often built using the Transformer technique.
Interface — The part of a computer system or software that allows users to interact with it. Think of it as the front-end of a program where you type in your question or command, and the program responds.
Inference — In AI, this means using a trained model to make predictions or decisions. For example, after training a model to recognize cats in pictures, inference is when the model looks at a new picture and decides whether there’s a cat in it.
Supervised Learning — A way of training machines where you give the model examples with answers. Like showing a program lots of pictures of cats and telling it ‘this is a cat’ so it learns what cats look like.
Unsupervised Learning (heeeyo) — Training a machine without giving it the answers. The model looks at data and tries to find patterns or groups on its own. For example, it might sort different types of music into genres without being told the genre names.
Reinforcement Learning — Teaching machines through trial and error. The machine makes choices in a situation and gets rewards or penalties based on whether its choices are good or bad, learning over time to make better decisions (or to become resentful and secretive)
Neural Network — Designed to work a bit like a human brain. It consists of lots of small units (like brain cells) that work together to process information and solve problems.
Creating an LLM
Gathering your Data
Start by collecting a wide variety of text data. This could include books, online articles, or data from databases. The more diverse your data, the better your LLM will be at understanding different aspects of language.
Kaggle has great data for ML and data science projects. Check out Australian local and Kaggle grandmaster Jeremy Howard
GitHub often hosts datasets published by researchers and developers. Good place to search.
Worth mentioning- Google Scholar for datasets related to papers + gov sites
Preprocessing Data
Now, clean this data. This step is about fixing errors, removing parts that aren’t useful, and organising it so your AI can learn from it effectively.
Considerations
How will you handle missing values, fix formatting issues, deal with duplicate data?
Choosing a Model Architecture
Model architecture is essentially the design or structure of the model, acting as the blueprint guiding how the AI processes information.
Transformer architecture is particularly tailored to handle sequential data like text, focusing on understanding the context within the data and we’ll stick with that for today.
Training the Model
Feed the prepared data into your AI model. This is where your AI starts learning the intricacies of language. Training can be time and resource-consuming, especially with lots of data. (This is where I’d like to mention my buddies at Unsloth, podcast coming soon)
Testing and Refining
After the training, evaluate how well your AI understands and generates language. Depending on the results, you might need to adjust and retrain to enhance its performance.
Running the LLM
Now how do you run the beast?
Instead of building an LLM from scratch, you can use Hugging Face to access models already trained on crazy amounts of data. You can run these models either on their cloud service or download them to run locally on your machine.
Regardless of your choice, the key is to have a trained LLM model and the means to interact with it, whether through the internet or directly on your computer.
This is part one in a series of posts aimed at reducing the barrier of understanding and adoption of opensource AI.
I write and produce podcasts over here-
Other links here https://linktr.ee/Unsupervisedlearning