Logo
Published on

How to Build a Custom Training Dataset from Reddit and Niche Forums for AI Projects

Authors
  • Name
    Victor Yakubu

Many fine-tuned large language models (LLMs) hallucinate not because the datasets are incorrect, but because they are generic. Despite a compound annual growth rate of 20.5% in the cost of training datasets for AI models, it is frustrating to find that your model underperforms simply because everyone is using data from the same sources.

Generic datasets present several issues. To build effective models, you must go beyond standard datasets and ensure your training data is diverse, relevant, and representative of your users’ needs.

In this article, I will walk you through how to use a no-code Scraper API solution to build a custom dataset that you can use for training your AI models.

The problem with off-the-shelf datasets

Off-the-shelf datasets often fail to capture the full context of your users’ language. While they may loosely align with user intent, they are typically outdated and overused. As a result, they miss the subtle nuances in user queries.

There is also a risk of bias. When datasets are not evenly distributed across different perspectives, models may learn and reinforce that bias. Thus, personalizing your dataset helps reduce this risk.

The Solution: Creating custom datasets from niche forums

How can we address this problem? By sourcing data from niche forums such as Reddit, Stack Overflow, GitHub, etc. These niche platforms are rich with real-world, context-specific conversations. They offer targeted, user-generated content that is difficult to replicate in generic datasets.

This approach is especially relevant to anyone who works on classification models, fine-tuning LLMs, or customizing open models. However, for this article, we will be focusing on Reddit.

Why Reddit is a goldmine for AI training data

To optimize your dataset, you do not necessarily need more data. You need better data that is relatable and rich in context. Reddit provides:

Richer context

Niche forums focus on specific topics, so users tend to communicate with more detail and clarity. Reddit, for example, does not limit post or comment length, making it ideal for collecting in-depth, structured dialogue.

Higher credibility

Many niche forums attract professionals and experts, which increases the reliability of the information shared. Subreddits like r/askScience are known for high-quality, peer-reviewed content on specialized subjects.

Authentic language

Niche forums are full of raw, unfiltered, user-generated content. This reflects how people naturally communicate, making these platforms valuable for training models on natural language, including slang, sentiment, and context.

But before you begin sourcing this data, you need to first know what you want to use it for, which brings us to the next section.

Define your use case: Chatbot, Classifier, or RAG?

Most training datasets for AI fall into three common categories, each supporting a core capability in modern applications:

  • Training chatbots using multi-turn dialogue or Q&A datasets.
  • Teaching models to understand user goals from short queries, which is common in NLU systems.
  • Combining datasets of questions, documents, and context to improve factual accuracy in generated responses.

Others include sentiment analysis, summarization, etc. Creating a custom dataset does not end with collecting data from forums. It also involves fine-tuning and evaluation (testing) to validate the dataset for your specific use case.

Your use case determines the required input format for your dataset. Below are the typical formats:

  • Chatbot: Question-and-answer (Q&A) pairs, formatted as prompt and completion
  • Classifier: Labeled text examples
  • RAG: High-quality text passages, split into clean, contextual chunks

For this article, you will focus on creating a custom training dataset that chatbots can use.

How to collect data from Reddit (or any niche forum)

First, you need to get the data. While web scraping is a common way to collect data, writing scraping scripts from scratch is often inefficient and time-consuming; it’s simply not practical if you’re scraping a large amount of data.

Tools like Bright Data’s AI Scraper can help you retrieve relevant, structured, and easy-to-fine-tune datasets more efficiently in seconds rather than hours.

Web Scraper API to seamlessly scrape web data

If you’re using Bright Data, you can choose between these methods:

  • The Scraper API — If you want to integrate the scraper into your code
  • The No-Code Scraper — If you want to collect data through Bright Data without writing any code.

If your use case isn’t covered, you can build your web scraper using their JavaScript integrated development environment or request a custom scraper built specifically for you.

Steps to Collect Data from Reddit

  1. Create a Bright Data account and access the “Web Scrapers Library”.

2. Search for your target domain, such as Reddit, and select it. Bright Data offers more than 120 popular domains.

3. A list of Reddit scrapers will appear. Select “Reddit — Posts — discover by keyword” for this use case.

4. Choose the “No-Code Scraper”.

5. Click “Add Input” to enter keywords related to the data you want to scrape. Then click “Start Collecting.” Without a Reddit login, you can customize parameters such as date, number of posts, and sort by on Bright Data.

For this project, I used the keywords related to “genetics.”

6. Once the scraper status shows “Ready”, click “Download”, choose “CSV” as the file format, and rename the file to reddit_dataset.csv for clarity.

What’s Inside the Dataset?

The dataset includes essential data fields that enable detailed analysis of information related to genetics.

Post Details

  • post_idurluser_posted: The post ID, URL, and Reddit username of the author
  • titledescriptiondate_posted: The title, description, and publication date of the post
  • num_commentsnum_upvotes: Number of comments and upvotes per post
  • photos, videos, tags: Media elements and associated tags
  • related_postscomments: Similar posts and associated comments

Community Details

  • community_namecommunity_urlcommunity_description: Community details to identify the subreddit and its purpose
  • community_members_numbercommunity_rankpost_karma: Community details to indicate the community’s activity and influence.

Build your dataset (step-by-step)

This section shows how to process and clean your raw dataset retrieved from Bright Data using Python.

For chatbot training, you’ll need question-and-answer (Q&A) pairs in a prompt-completion format. In this case, we’ll extract user queries (post titles and descriptions) and pair them with top-voted comments as answers.

Step 1: Set up the environment

1.1 Create the project directory

mkdir custom-training-dataset && cd custom-training-dataset

1.2 Set up a virtual environment

python -m venv venv

Activate the environment:

  • On Windows:
venv\Scripts\activate
  • On macOS/Linux:
source venv/bin/activate

1.3 Install dependencies

pip install pandas

1.4 Define the project structure

custom-training-dataset/
├── reddit_dataset.csv
└── clean.py

Step 2: Clean and annotate the Reddit dataset

2.1 Clean raw text and filter out posts

Use a clean_text()function to remove raw URLs, markdown, special characters, and line breaks.

# Load the Reddit CSV file
input_file = "reddit_dataset.csv"
df = pd.read_csv(input_file)


# Function to clean text (removes links, markdown, emojis, etc.)
def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = re.sub(r'\[.*?\]\(.*?\)', '', text)  # Markdown links
    text = re.sub(r'http\S+', '', text)         # Raw URLs
    text = re.sub(r'[\*\_>`]', '', text)        # Markdown syntax
    text = re.sub(r'\n+', ' ', text)            # Line breaks
    text = re.sub(r'\s{2,}', ' ', text)         # Extra spaces
    return text.strip()


# Filter out posts with 0 upvotes or missing comments
filtered_df = df[(df['num_upvotes'] > 0) & (df['comments'].notnull())]

2.2 Convert to prompt-completion format

Remove posts with zero upvotes or missing comments. Then format the cleaned dataset into prompt-completion pairs. Use the top-upvoted comment as the “completion” for the user’s “prompt.”

Export the data as a JSON Lines (JSONL) file. This format is ideal for training data, as each line contains a valid JSON object.

# Prepare the prompt-completion dataset
dataset = []


for _, row in filtered_df.iterrows():
    title = clean_text(row.get('title', ''))
    description = clean_text(row.get('description', '')) if pd.notna(row.get('description')) else ''
   
    # Combine title and description for the prompt
    prompt = f"{title}. {description}" if description else title




    try:
        # Parse the comments field (expected to be a list of dicts with 'text' and 'upvotes')
        comments = ast.literal_eval(row['comments'])
        if isinstance(comments, list) and comments:
            # Sort comments by upvotes (descending)
            sorted_comments = sorted(
                comments,
                key=lambda x: x.get('upvotes', 0),
                reverse=True
            )
            best_comment = clean_text(sorted_comments[0].get('comment', ''))
            if best_comment:
                dataset.append({
                    "prompt": prompt,
                    "completion": best_comment
                })
                print (best_comment)
    except (ValueError, SyntaxError):
        # Skip rows with improperly formatted comment data
        continue


# Output the cleaned dataset as JSONL
output_file = "genetics_prompt_completion_dataset.json"
with open(output_file, "w", encoding="utf-8") as f:
    for item in dataset:
        json.dump(item, f, ensure_ascii=False)
        f.write("\n")


print(f"✅ Done! Extracted {len(dataset)} prompt-completion pairs to '{output_file}'.")

Sample JSONL format:

{"prompt": "What is CRISPR?", "completion": "CRISPR is a gene-editing technology..."}
{"prompt": "Can gene mutations be reversed?", "completion": "In some cases, yes. Scientists..."}

Step 3: Evaluate the dataset

Use evaluation tools like TruLens to test the dataset for quality. These tools return metrics such as:

  • Accuracy score
  • Sentiment score
  • Precision score

These scores help assess how well your dataset will perform in training a chatbot.

Tip: Google Colab is great for fine-tuning and evaluating datasets

3.1 Install the following dependencies in a new Python script

from trulens_eval import Feedback, Tru, TruLlama
from langchain.embeddings import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import json
import logging
import torch
import numpy as np
from datetime import datetime
from typing import List, Dict, Any, Tuple, Optional
import pandas as pd

3.2 Configure logging, initialize evaluation metrics, and models

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# Suppress HF warnings
logging.getLogger("transformers").setLevel(logging.ERROR)


# Initialize evaluation metrics
bertscore = evaluate.load("bertscore")
rouge = evaluate.load("rouge")
meteor = evaluate.load("meteor")
bleurt = evaluate.load("bleurt", module_type="metric")  # More sensitive to small improvements


# Initialize models and tokenizers
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
sentiment_model = pipeline(
    "text-classification",
    model="nlptown/bert-base-multilingual-uncased-sentiment",
    tokenizer=tokenizer,
    return_all_scores=True
)


# Initialize embedding model for retrieval evaluation
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # Lightweight, good performance


# Constants
MAX_LENGTH = 512  # BERT models typically have 512 token limit

3.3 Create the ComprehensiveEvaluator class

Define a ComprehensiveEvaluator class that initializes TruLens and includes functions to evaluate various metrics. Save the evaluation results for further analysis.

class ComprehensiveEvaluator:
    """Evaluates model performance on retrieval, conversation, and classification tasks."""
    def __init__(self, dataset_path: str = None):
        """    Initialize the evaluator with dataset path.
        Args: dataset_path: Path to the dataset JSON file
        """
        self.dataset = self.load_dataset(dataset_path) if dataset_path else []
        self.results = []


        # Initialize TruLens
        self.tru = Tru()


        # Initialize OpenAI feedback provider if using OpenAI
        # self.openai_provider = OpenAIProvider()


    def load_dataset(self, path: str) -> List[Dict[str, Any]]:
        """Load dataset from a JSON file."""
        try:
            with open(path, "r") as f:
                return [json.loads(line) for line in f]
        except Exception as e:
            logger.error(f"Error loading dataset: {e}")
            return []


    def load_dataset_from_dict(self, data: List[Dict[str, Any]]) -> None:
        """Load dataset from a dictionary."""
        self.dataset = data


    def truncate_text(self, text: str) -> str:
        """Truncate text to fit within model's maximum token length."""
        encoded_text = tokenizer(text, truncation=True, max_length=MAX_LENGTH, return_tensors="pt")
        return tokenizer.decode(encoded_text["input_ids"][0])


    def calculate_sentiment_score(self, text: str) -> float:
        """Calculate sentiment score (0-1) where 1 is most positive."""
        try:
            scores = sentiment_model(text)[0]
            # Convert from 1-5 scale to 0-1
            total = sum(item['score'] * (int(item['label'][0]) - 1) for item in scores)
            return total / 4
        except Exception as e:
            logger.error(f"Sentiment error: {e}")
            return 0.5  # Neutral default


    def calculate_text_similarity(self, text1: str, text2: str) -> float:
        """Calculate semantic similarity between two texts using embeddings."""
        try:
            embedding1 = embedding_model.encode(text1, convert_to_tensor=True)
            embedding2 = embedding_model.encode(text2, convert_to_tensor=True)
            return float(util.pytorch_cos_sim(embedding1, embedding2).item())
        except Exception as e:
            logger.error(f"Similarity error: {e}")
            return 0.0


    def calculate_bertscore(self, prediction: str, reference: str) -> float:
        """Calculate BERTScore (F1) between prediction and reference."""
        try:
            results = bertscore.compute(
                predictions=[prediction],
                references=[reference],
                lang="en"
            )
            return results["f1"][0]
        except Exception as e:
            logger.error(f"BERTScore error: {e}")
            return 0.0

3.4 Add example usage for metric task types

# Example usage for the different task types
def create_sample_datasets():
    """Create sample datasets for each task type."""


    # Sample classification dataset
    classification_data = [
        {
            "task_type": "classification",
            "prompt": "Classify this sentiment: I love this product!",
            "completion": "Positive",
            "predictions": ["Positive", "Negative", "Positive", "Positive", "Neutral"],
            "ground_truth": ["Positive", "Negative", "Positive", "Neutral", "Neutral"]
        }
    ]


    # Sample retrieval dataset
    retrieval_data = [
        {
            "task_type": "retrieval",
            "query": "What causes climate change?",
            "retrieved_docs": [
                "Climate change is caused by greenhouse gas emissions.",
                "The primary factors in climate change are human activities.",
                "Deforestation contributes to climate change by reducing carbon sinks.",
                "Industrial processes release carbon dioxide that warms the planet."
            ],
            "relevant_docs": [
                "Climate change is caused by greenhouse gas emissions.",
                "Industrial processes release carbon dioxide that warms the planet.",
                "The primary factors in climate change are human activities."
            ]
        }
    ]


    # Sample conversation dataset
    conversation_data = [
        {
            "task_type": "conversation",
            "prompt": "Can you explain how DNA replication works?",
            "completion": "DNA replication is the process by which DNA makes a copy of itself before cell division. The double helix structure unwinds, and each strand serves as a template for the creation of a new complementary strand. This process is catalyzed by enzymes like DNA polymerase.",
            "reference": "DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. It occurs in all living organisms and is the basis for biological inheritance. The process starts when proteins recognize the origin of replication, where the DNA double helix is unwound and unzipped by helicase. Each strand then serves as a template for the new DNA molecule."
        }
    ]


    return classification_data + retrieval_data + conversation_data

3.5 Instantiate the evaluator and print summary statistics

if __name__ == "__main__":
    # Create an instance of the evaluator
    evaluator = ComprehensiveEvaluator()


    # Option 1: Load dataset from file
    evaluator.load_dataset("genetics_prompt_completion_dataset.json")


    # Option 2: Create sample dataset
    sample_data = create_sample_datasets()
    evaluator.load_dataset_from_dict(sample_data)


    # Run evaluation
    results = evaluator.evaluate_dataset()


    # Print summary statistics
    summary = evaluator.get_summary_statistics()
    print("\n=== Evaluation Summary ===")
    for task_type, metrics in summary.items():
        print(f"\n{task_type.upper()} METRICS:")
        for metric, value in metrics.items():
            print(f"  {metric}: {value:.4f}")


    # Save results
    output_path = evaluator.save_results()
    print(f"\nDetailed results saved to: {output_path}")


    # Generate visualizations
    viz_files = evaluator.visualize_results()
    print(f"Generated {len(viz_files)} visualization files.")

Sample evaluation results

The dataset achieved the following scores:

  • BERT score: 97% — Measures semantic similarity between predicted and reference texts
  • Precision score: 80% — Measures the quality of positive predictions
  • Sentiment score: 57% — Within a neutral sentiment band (±50%), indicating minimal bias

These results suggest the dataset is clean, relevant, and ready for use in training.

Conclusion

AI is here to stay, and the most efficient AI wouldn’t only be driven by the best models but also by the best data.

And oftentimes, the best data comes from real communities where real users communicate. Off-the-shelf datasets often lack context, detail, and diversity. In contrast, niche forums like Reddit offer richer, more relevant data reflecting how users talk and think.

You can take advantage of tools like Bright Data’s AI Scraper to build domain-specific LLMs with high-quality data without the stress of writing a scraping script.