Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Severianย 
posted an update 1 day ago
Post
2259
Create and Train Your Own Expert LLM: Generating Synthetic, Fact-Based Datasets with LMStudio/Ollama and then fine-tuning with MLX and Unsloth

Hey everyone!

I know there are tons of videos and tutorials out there already but I've noticed a lot of questions popping up in community posts about using synthetic datasets for creative projects and how to transform personal content into more factual material. In my own work doing enterprise-level SFT and crafting my open-source models, I've enhanced a Python framework originally shared by the creator of the Tess models. This improved stack utilizes local language models and also integrates the Wikipedia dataset to ensure that the content generated is as accurate and reliable as possible.

I've been thinking of putting together a comprehensive, step-by-step course/guide on creating your own Expert Language Model. From dataset preparation and training to deployment on Hugging Face and even using something like AnythingLLM for user interaction. I'll walk you through each phase, clarifying complex concepts and troubleshooting common pitfalls.

Let me know if this interests you!

Most of the datasets and models I've made have been using these scripts and my approach

I'd be very interested. To me most of the usual tutorials are missing the "verify" steps. For example:

  • How does the user verify the integrity of the dataset without having to read thousands of question pairs (if possible at all)?
  • How does the user verify the LLM is even utilizing the dataset? And with what amount of preference is it being used over the base model?
  • How does the user verify that all the possible unwanted entries are gone?

Those kind of questions are often missing.

ยท

Yes please I am currently working on trying to create a industry/task specific model. Currently working on a synthetic dataset but am really winging it mostly. A comprehensive guide would be great especially in terms of best formatting and processing for the final dataset.

I'm very interested. I also have a rudimentary GITHUB project.
Donโ€™t know if your project can synthesize pre-training data?

ยท

what can be considered pre-training data ?

as with any fresh model ; its the pipeline which should be taught first.. as this is what the model has evolved from :
Firstly Text generation ::

LANGAUGE
Here if possible large corpus of information or short storys or short lessons (single post) ... even childrens books (it needs to understand langauge first)
uterrances are useful at this stage too, ie prompts without responses and responses without querys ....

The AIM is to generate language : ... as well as feeding in correct corpus , for me i believe poetry, articles etc are the best way for pre-training a text gen model:
as soon as we can generate a good next word prediction :

THOUGHT:
We can train for simple Input/Output sequences : IE: Question and Answer :
So here we should begin with CONVERSATION: i:e greetings ..... SO VITAL! small talk: , Discussions on topics : movie scripts and character discussions in thematic personas etc: - maybe not even knowledge based qa: (because its pretraining only) :
then we can ADD THOUGHT:
ie: the same QA: and Conversational data from the previous stage , with thoughts:

  • in this stage we can also begin to do some fewshot maths and problem solving prompts(with chain of thoughts with explanations and solutions as thoughts)
    CONTEXT:
    then we can begin with context based query ie provide a context and query for the answer: this task is the beginning of INSTRUCT:
    INSTRUCT:
    using the context style prompt start adding the task based querys for code etc medical etc:

TASKS: when training for tasks its BEST TO OVERFIT: ??? yes to over fit the model and merge it into the base model ! hence grabbing the fully embedded task and still retaining original skills even retraining the same dataset on the merged model for a single epoch to align the new info into the merged model:
ie is a specialized prompt : what is a <>:
with a specialized output : here is the definition >>> , which we may not wish to come every time we say hello??

Later by adding reasoning and other stuff your model will begin to converge into a AI model (Chatbot with knowedge and perform task) ready as a base model !!

MAYBE!

Hi,
I'm just learning about ML.
I like to build my module for a masters project, with all kinds of photos and videos, (NeRF) and use AI to create art/games
Would this be helpful to me, I have a lot of data.

Awesome! I am already working on a better datasets generator. I think of making a generation step-by-step like agent. It's good but too slow๐Ÿ˜ญ

Awesome!!