Tutorial: Text Classification¶
Build a complete text classifier from scratch using HoloVec’s hyperdimensional computing approach.
In this tutorial, you’ll create a sentiment analysis system that classifies movie reviews as positive or negative. We’ll use n-gram encoding to capture local word patterns and build prototypes for each class.
Time: 20-30 minutes
What you’ll learn:
How to encode text with n-grams for classification
Building class prototypes from training examples
Classifying new text with similarity matching
Evaluating and optimizing classifier performance
Best practices for text classification with HDC
Prerequisites¶
Basic Python programming
Understanding of text classification concepts
HoloVec installed (
pip install holovec)
Overview¶
Text classification with HDC works differently from traditional approaches:
Traditional ML: 1. Extract features (TF-IDF, word embeddings) 2. Train a classifier (SVM, neural network) 3. Complex optimization and many parameters
HDC Approach: 1. Encode text as hypervectors (n-grams) 2. Bundle examples into class prototypes 3. Classify by similarity matching 4. Fast, simple, interpretable
Advantages:
No gradient descent or complex optimization
Few hyperparameters
Fast training (single pass)
Incremental learning (add examples anytime)
Interpretable (similarity scores)
Step 1: Setup and Imports¶
First, let’s import everything we need:
import numpy as np
from holovec import VSA
from holovec.encoders import NGramEncoder
from collections import Counter
# For reproducibility
np.random.seed(42)
print("HoloVec Text Classification Tutorial")
print("=" * 50)
Step 2: Choose Model and Parameters¶
Select a VSA model and configure encoding parameters:
# Create VSA model
# FHRR: Good for general-purpose classification
# 10000 dimensions: Balance between capacity and speed
model = VSA.create('FHRR', dim=10000, seed=42)
print(f"\nModel: {model.model_name}")
print(f"Dimension: {model.dimension}")
print(f"Capacity: ~{model.dimension // 100} distinct items")
Why these choices?
FHRR model: Supports smooth similarity, good for text
10,000 dimensions: Enough capacity for vocabulary + n-grams
Seed: Ensures reproducible results
Step 3: Prepare Training Data¶
Create a simple movie review dataset:
# Training examples: (text, label)
training_data = [
# Positive reviews
("This movie was excellent and entertaining", "positive"),
("I loved this film it was amazing", "positive"),
("Great acting and wonderful story", "positive"),
("Best movie I have seen this year", "positive"),
("Fantastic film highly recommended", "positive"),
("Brilliant performance truly outstanding", "positive"),
# Negative reviews
("This movie was terrible and boring", "negative"),
("I hated this film it was awful", "negative"),
("Poor acting and weak story", "negative"),
("Worst movie I have seen this year", "negative"),
("Horrible film do not recommend", "negative"),
("Terrible performance very disappointing", "negative"),
]
print(f"\nTraining examples: {len(training_data)}")
print(f" Positive: {sum(1 for _, label in training_data if label == 'positive')}")
print(f" Negative: {sum(1 for _, label in training_data if label == 'negative')}")
Real-world datasets:
For production use, consider:
IMDB movie reviews (50K reviews)
Amazon product reviews
Twitter sentiment datasets
Your own labeled data
Step 4: Text Preprocessing¶
Simple preprocessing to normalize text:
def preprocess(text):
"""Basic text preprocessing."""
# Lowercase and split into words
words = text.lower().split()
# Remove punctuation (simple approach)
words = [w.strip('.,!?;:()[]"\'') for w in words]
# Remove empty strings
words = [w for w in words if w]
return words
# Test preprocessing
sample_text = "This movie was excellent and entertaining!"
sample_words = preprocess(sample_text)
print(f"\nPreprocessing example:")
print(f" Original: {sample_text}")
print(f" Processed: {sample_words}")
Extension ideas:
Remove stop words (‘the’, ‘a’, ‘is’)
Stemming/lemmatization
Handle special characters
N-gram at character level for misspellings
Step 5: Build Vocabulary¶
Extract all unique words from training data:
# Extract vocabulary from training data
all_words = []
for text, _ in training_data:
all_words.extend(preprocess(text))
# Get unique words and their frequencies
word_freq = Counter(all_words)
vocabulary = list(word_freq.keys())
print(f"\nVocabulary statistics:")
print(f" Unique words: {len(vocabulary)}")
print(f" Total words: {len(all_words)}")
print(f" Most common: {word_freq.most_common(5)}")
# Create hypervector for each word
word_hvs = {
word: model.random(seed=hash(word) % 100000)
for word in vocabulary
}
print(f" Word hypervectors created: {len(word_hvs)}")
Important notes:
Each word gets a unique random hypervector
Using
hash(word)ensures same word → same HV across runsFor large vocabularies (>1000 words), consider filtering rare words
Step 6: Create N-gram Encoder¶
Set up an encoder to capture word sequences:
# Create bigram encoder (n=2)
# Captures pairs of consecutive words
encoder = NGramEncoder(
model,
item_to_hv=word_hvs,
n=2, # Bigrams: "this movie", "movie was", etc.
mode='bundle' # Bundle all bigrams together
)
print(f"\nN-gram Encoder:")
print(f" N-gram size: 2 (bigrams)")
print(f" Mode: bundle")
print(f" Example bigrams from '{sample_text}':")
# Show example bigrams
words = preprocess(sample_text)
for i in range(len(words) - 1):
bigram = f"{words[i]} {words[i+1]}"
print(f" {bigram}")
N-gram size selection:
n=1 (unigrams): Bag-of-words, no order information
n=2 (bigrams): Captures local word pairs (recommended)
n=3 (trigrams): More specific patterns, needs more data
Higher n → more specific but requires more training examples
Step 7: Encode Training Examples¶
Convert each text into a hypervector:
# Encode all training examples
encoded_examples = []
print("\nEncoding training examples...")
for text, label in training_data:
words = preprocess(text)
hv = encoder.encode(words)
encoded_examples.append((hv, label))
print(f" Encoded: {len(encoded_examples)} examples")
# Check encoding
ex_text, ex_label = training_data[0]
ex_hv, _ = encoded_examples[0]
print(f"\nExample encoding:")
print(f" Text: '{ex_text}'")
print(f" Label: {ex_label}")
print(f" HV shape: {ex_hv.shape}")
print(f" HV type: {type(ex_hv)}")
Step 8: Build Class Prototypes¶
Create a prototype for each class by bundling examples:
# Group examples by class
class_hvs = {}
for label in ['positive', 'negative']:
# Get all hypervectors for this class
hvs = [hv for hv, lbl in encoded_examples if lbl == label]
# Bundle them into a prototype
class_hvs[label] = model.bundle(hvs)
print(f"\n{label.capitalize()} prototype:")
print(f" Examples bundled: {len(hvs)}")
print(f" Prototype shape: {class_hvs[label].shape}")
What is bundling?
Bundling (superposition) combines multiple hypervectors into one that is similar to all of them. It’s like averaging but preserves the high-dimensional structure.
Input: N hypervectors representing positive reviews
Output: 1 prototype hypervector that captures “positive-ness”
Step 9: Classify New Text¶
Test the classifier on new examples:
def classify(text):
"""Classify a text string."""
# Preprocess and encode
words = preprocess(text)
# Handle unknown words gracefully
known_words = [w for w in words if w in word_hvs]
if not known_words:
return None, 0.0 # Cannot classify
test_hv = encoder.encode(known_words)
# Find most similar class
best_label = None
best_sim = float('-inf')
for label, prototype in class_hvs.items():
sim = float(model.similarity(test_hv, prototype))
if sim > best_sim:
best_sim = sim
best_label = label
return best_label, best_sim
# Test examples
test_reviews = [
"This film was amazing and wonderful",
"Terrible movie very disappointing",
"Great story and excellent acting",
"Awful film worst ever",
]
print("\n" + "=" * 50)
print("Classification Results")
print("=" * 50)
for text in test_reviews:
label, sim = classify(text)
print(f"\nReview: '{text}'")
print(f" Predicted: {label}")
print(f" Confidence: {sim:.3f}")
Step 10: Evaluate Performance¶
Test on held-out data and compute accuracy:
# Create test set
test_data = [
# Positive
("Excellent movie highly enjoyable", "positive"),
("Loved the story and acting", "positive"),
("Outstanding film wonderful experience", "positive"),
# Negative
("Poor film very boring", "negative"),
("Hated this movie terrible", "negative"),
("Disappointing and awful story", "negative"),
]
# Evaluate
correct = 0
total = len(test_data)
print("\n" + "=" * 50)
print("Evaluation on Test Set")
print("=" * 50)
for text, true_label in test_data:
pred_label, confidence = classify(text)
is_correct = (pred_label == true_label)
correct += is_correct
marker = "✓" if is_correct else "✗"
print(f"\n{marker} '{text}'")
print(f" True: {true_label}, Predicted: {pred_label} ({confidence:.3f})")
accuracy = correct / total
print(f"\n" + "=" * 50)
print(f"Accuracy: {correct}/{total} = {accuracy:.1%}")
print("=" * 50)
Typical results:
Small dataset (like ours): 70-90% accuracy
Medium dataset (hundreds of examples): 85-95% accuracy
Large dataset (thousands of examples): 90-98% accuracy
Step 11: Analyze Class Similarities¶
Understand the learned representations:
# Similarity between class prototypes
pos_neg_sim = float(model.similarity(
class_hvs['positive'],
class_hvs['negative']
))
print(f"\nClass Analysis:")
print(f" Positive-Negative similarity: {pos_neg_sim:.3f}")
print(f" (Close to 0 = well-separated classes)")
# Most confident classifications
print(f"\nConfidence analysis:")
for text in test_reviews:
label, sim = classify(text)
print(f" {label:8s}: {sim:.3f} - '{text[:40]}...'")
Good separation indicators:
Class prototypes have low similarity (< 0.1)
Confident predictions have high similarity (> 0.5)
Wrong predictions often have low confidence
Step 12: Extensions and Improvements¶
Ways to improve the classifier:
1. Add more training data:
# More examples → better prototypes
# Aim for 50-100+ examples per class
2. Tune n-gram size:
# Try trigrams for more context
encoder_3gram = NGramEncoder(
model,
item_to_hv=word_hvs,
n=3, # Trigrams
mode='bundle'
)
3. Combine multiple n-gram sizes:
def encode_multi_ngram(words):
"""Encode with multiple n-gram sizes."""
hv_bigram = encoder_2gram.encode(words)
hv_trigram = encoder_3gram.encode(words)
# Bundle both representations
return model.bundle([hv_bigram, hv_trigram])
4. Add confidence threshold:
def classify_with_threshold(text, threshold=0.3):
"""Classify with confidence threshold."""
label, sim = classify(text)
if sim < threshold:
return "uncertain", sim
return label, sim
5. Handle unknown words:
# Add <UNK> token for unknown words
word_hvs['<UNK>'] = model.random(seed=999)
def encode_with_unk(words):
safe_words = [w if w in word_hvs else '<UNK>' for w in words]
return encoder.encode(safe_words)
6. Use larger vocabulary:
# Pre-trained word lists
# Common English words, domain-specific terms, etc.
7. Incremental learning:
def add_training_example(text, label):
"""Add new example to existing prototype."""
words = preprocess(text)
new_hv = encoder.encode(words)
# Update prototype by bundling with new example
class_hvs[label] = model.bundle([
class_hvs[label],
new_hv
])
Complete Code¶
Here’s the full classifier in one place:
import numpy as np
from holovec import VSA
from holovec.encoders import NGramEncoder
from collections import Counter
# Setup
np.random.seed(42)
model = VSA.create('FHRR', dim=10000, seed=42)
# Training data
training_data = [
("This movie was excellent and entertaining", "positive"),
("I loved this film it was amazing", "positive"),
("Great acting and wonderful story", "positive"),
("Best movie I have seen this year", "positive"),
("Fantastic film highly recommended", "positive"),
("Brilliant performance truly outstanding", "positive"),
("This movie was terrible and boring", "negative"),
("I hated this film it was awful", "negative"),
("Poor acting and weak story", "negative"),
("Worst movie I have seen this year", "negative"),
("Horrible film do not recommend", "negative"),
("Terrible performance very disappointing", "negative"),
]
# Preprocessing
def preprocess(text):
words = text.lower().split()
words = [w.strip('.,!?;:()[]"\'') for w in words]
return [w for w in words if w]
# Build vocabulary
all_words = []
for text, _ in training_data:
all_words.extend(preprocess(text))
vocabulary = list(set(all_words))
word_hvs = {word: model.random(seed=hash(word) % 100000)
for word in vocabulary}
# Create encoder
encoder = NGramEncoder(model, item_to_hv=word_hvs, n=2, mode='bundle')
# Encode training data
encoded_examples = []
for text, label in training_data:
words = preprocess(text)
hv = encoder.encode(words)
encoded_examples.append((hv, label))
# Build class prototypes
class_hvs = {}
for label in ['positive', 'negative']:
hvs = [hv for hv, lbl in encoded_examples if lbl == label]
class_hvs[label] = model.bundle(hvs)
# Classifier
def classify(text):
words = preprocess(text)
known_words = [w for w in words if w in word_hvs]
if not known_words:
return None, 0.0
test_hv = encoder.encode(known_words)
best_label = None
best_sim = float('-inf')
for label, prototype in class_hvs.items():
sim = float(model.similarity(test_hv, prototype))
if sim > best_sim:
best_sim = sim
best_label = label
return best_label, best_sim
# Test
test_text = "This film was amazing and wonderful"
label, confidence = classify(test_text)
print(f"Text: '{test_text}'")
print(f"Predicted: {label} (confidence: {confidence:.3f})")
Best Practices Summary¶
Model Selection:
Use FHRR or HRR for text classification
10,000 dimensions for medium vocabularies (<1000 words)
20,000+ dimensions for large vocabularies (>1000 words)
Encoding:
Start with bigrams (n=2)
Use trigrams (n=3) if you have enough data
Consider combining multiple n-gram sizes
Training:
Need 20-50 examples minimum per class
More examples = better prototypes
Balanced classes help (equal positive/negative)
Evaluation:
Always test on held-out data
Check confidence scores for uncertainty
Analyze failure cases to improve
Production Deployment:
Save prototypes (
word_hvs,class_hvs)Preprocess consistently
Handle unknown words gracefully
Set confidence thresholds
Common Issues and Solutions¶
Problem: Low accuracy (< 60%)
Solutions:
Add more training examples
Check class balance
Try different n-gram sizes
Ensure good preprocessing
Problem: High confidence on wrong predictions
Solutions:
Classes may be too similar
Need more distinctive training examples
Try larger dimension
Problem: Unknown word errors
Solutions:
Add <UNK> token to vocabulary
Filter rare words before encoding
Use more training data to expand vocabulary
Problem: Slow classification
Solutions:
Use MAP or BSC model (faster)
Reduce vocabulary size
Use PyTorch backend with GPU
Next Steps¶
Explore more:
Document Classification with N-grams - Extended text classification example
Encoding Data - Deep dive on encoders
Choosing a VSA Model - Model selection guide
Try these datasets:
IMDB reviews (50K examples)
20 Newsgroups (18K documents)
AG News (120K articles)
Advanced topics:
Multi-class classification (>2 classes)
Hierarchical classification
Online learning (update prototypes dynamically)
Ensemble methods (combine multiple encoders)
Conclusion¶
You’ve built a complete text classifier using hyperdimensional computing!
Key takeaways:
HDC provides a simple, fast approach to text classification
No complex optimization or gradient descent needed
Prototypes capture class characteristics through bundling
Classification is just similarity matching
Easy to extend and adapt to new data
Advantages of HDC for text:
Fast training (single pass)
Incremental learning
Interpretable similarity scores
Few hyperparameters
Works well with limited data
The same principles apply to many other classification tasks - try applying this to your own text data!