Can Cadbury Make AI Mediocre Again? A Scientific Analysis
The new Cadbury 5Star ad indicates that the company has server farm that spits out nonsensical websites and content.
The aim is to mislead AI by corrupting their training data. The foundation of the idea lies in the “fact” that AI models are trained using huge amounts of data on the internet.
Now, of course it is a marketing campaign, but lets explore the feasibility of this tactic.
The Feasibility of Cadbury Making AI Mediocre With Server Farm(s)
On a very basic level, companies like Anthropic, OpenAI, etc. don’t train on raw internet data. The data undergoes extensive validation checks — often involving humans.
But there are more sophisticated and automated techniques to filter out gibberish content.
One such technique is — Perplexity Check
“Perplexity is a mathematical measure of how well a language model predicts a sample of text. In the context of data filtering, perplexity checks help identify nonsensical or machine-generated text by measuring how “surprised” a reference model is by the content.”
# Example thresholds (values are illustrative)
THRESHOLDS = {
'natural_text': 50,
'technical_text': 100,
'creative_writing': 75,
'suspicious_threshold': 200
}
def filter_content(text, domain='natural_text'):
perplexity_score = calculate_perplexity(text)
if perplexity_score > THRESHOLDS[domain]:
return 'filtered_out'
return 'accepted'
Thus, the Cadbury campaign’s strategy of generating nonsensical content would likely fail — perplexity checks would easily identify and filter out such content before it ever reached the training data. Modern AI training pipelines use perplexity as just one of many quality control measures, making it very difficult to deliberately introduce low-quality data into the training process.
AI companies also use data source filtering. Data is often weighted based on source reliability.
From a scalability point of view, modern language models train on hundreds of terabytes or even petabytes of data. A single server farm would need enormous resources to make a meaningful impact. Thus the signal-to-noise ratio would make random content statistically insignificant.
Then comes cross validation. AI models does not generally trust one source for a specific piece of information. Cross-validation across multiple data sources ensures consistency. Content clustering helps identify and exclude outlier data.
Server Isolation Checking and Content Pattern Identification
It’s relatively easy for big AI companies to detect how isloated a server is against the entire scheme of internet. If the server is quite isolated, chances are it is part of the server farm. Similarly, content pattern checks can reflect the same result.\
Look at this code:
import networkx as nx
from collections import defaultdict
from typing import Dict, List, Set, Tuple
from datetime import datetime, timedelta
import numpy as np
from urllib.parse import urlparse
import re
class ContentFarmDetector:
def __init__(self):
self.web_graph = nx.Graph()
self.content_patterns = defaultdict(list)
self.domain_stats = defaultdict(dict)
self.thresholds = {
'min_backlinks': 5,
'min_unique_domains': 3,
'max_content_similarity': 0.8,
'suspicious_growth_rate': 100, # pages per day
'min_domain_age_days': 30,
'max_template_similarity': 0.7
}
def analyze_domain_network(self, domain: str) -> Dict[str, float]:
"""
Analyze domain's position and relationships in the web graph
"""
if domain not in self.web_graph:
return {
'centrality': 0.0,
'clustering': 0.0,
'backlink_diversity': 0.0
}
# Get all neighbors (linked domains)
neighbors = set(self.web_graph.neighbors(domain))
# Calculate network metrics
centrality = nx.degree_centrality(self.web_graph)[domain]
clustering = nx.clustering(self.web_graph, domain)
# Analyze backlink diversity
backlink_domains = set()
for neighbor in neighbors:
parsed = urlparse(neighbor)
backlink_domains.add(parsed.netloc)
backlink_diversity = len(backlink_domains) / max(len(neighbors), 1)
return {
'centrality': centrality,
'clustering': clustering,
'backlink_diversity': backlink_diversity
}
def analyze_content_patterns(self, domain: str, content: str) -> Dict[str, float]:
"""
Detect templated or machine-generated content patterns
"""
def extract_structure(text: str) -> str:
# Extract structural patterns (simplified)
# Remove actual words but keep punctuation and structure
structure = re.sub(r'\w+', 'WORD', text)
return structure
def calculate_similarity(text1: str, text2: str) -> float:
# Simplified similarity calculation
struct1 = extract_structure(text1)
struct2 = extract_structure(text2)
# Compare structures using Levenshtein distance
from difflib import SequenceMatcher
return SequenceMatcher(None, struct1, struct2).ratio()
# Store new content pattern
self.content_patterns[domain].append({
'content': content,
'structure': extract_structure(content),
'timestamp': datetime.now()
})
# Analyze patterns
recent_patterns = self.content_patterns[domain][-50:] # Look at last 50 pieces
# Calculate average structural similarity
similarities = []
for i in range(len(recent_patterns)):
for j in range(i + 1, len(recent_patterns)):
sim = calculate_similarity(
recent_patterns[i]['content'],
recent_patterns[j]['content']
)
similarities.append(sim)
avg_similarity = np.mean(similarities) if similarities else 0
return {
'template_similarity': avg_similarity,
'pattern_count': len(set(p['structure'] for p in recent_patterns))
}
def analyze_growth_patterns(self, domain: str) -> Dict[str, float]:
"""
Analyze suspicious growth patterns
"""
patterns = self.content_patterns[domain]
if not patterns:
return {'growth_rate': 0.0, 'consistency_score': 0.0}
# Calculate content creation rate
timestamps = [p['timestamp'] for p in patterns]
time_range = max(timestamps) - min(timestamps)
days = time_range.total_seconds() / (24 * 3600) or 1
growth_rate = len(patterns) / days
# Analyze timestamp consistency
time_diffs = np.diff([ts.timestamp() for ts in sorted(timestamps)])
consistency_score = np.std(time_diffs) if len(time_diffs) > 0 else 0
return {
'growth_rate': growth_rate,
'consistency_score': consistency_score
}
def is_content_farm(self, domain: str) -> Tuple[bool, Dict[str, float]]:
"""
Make final determination if a domain is likely a content farm
"""
# Collect all metrics
network_metrics = self.analyze_domain_network(domain)
latest_content = self.content_patterns[domain][-1]['content'] if self.content_patterns[domain] else ""
content_metrics = self.analyze_content_patterns(domain, latest_content)
growth_metrics = self.analyze_growth_patterns(domain)
# Combine all metrics
scores = {**network_metrics, **content_metrics, **growth_metrics}
# Decision logic
is_farm = (
network_metrics['backlink_diversity'] < 0.3 or
network_metrics['centrality'] < 0.1 or
content_metrics['template_similarity'] > self.thresholds['max_template_similarity'] or
growth_metrics['growth_rate'] > self.thresholds['suspicious_growth_rate']
)
return is_farm, scores
# Example usage
def main():
detector = ContentFarmDetector()
# Add some example domains and connections
real_domain = "legitimate-site.com"
farm_domain = "content-farm-example.com"
# Add legitimate domain with good connections
detector.web_graph.add_edge(real_domain, "reference1.com")
detector.web_graph.add_edge(real_domain, "reference2.com")
detector.web_graph.add_edge(real_domain, "reference3.com")
# Add suspected content farm with minimal connections
detector.web_graph.add_edge(farm_domain, "farm-mirror1.com")
# Simulate some content patterns
for _ in range(10):
# Legitimate content with variations
detector.content_patterns[real_domain].append({
'content': f"Unique content {np.random.randint(1000)}",
'structure': "WORD WORD WORD",
'timestamp': datetime.now() - timedelta(days=np.random.randint(30))
})
# Farm content with high similarity
detector.content_patterns[farm_domain].append({
'content': f"Generated content number {_}",
'structure': "WORD WORD WORD NUMBER",
'timestamp': datetime.now() - timedelta(minutes=np.random.randint(60))
})
# Check both domains
for domain in [real_domain, farm_domain]:
is_farm, scores = detector.is_content_farm(domain)
print(f"\nAnalysis for {domain}:")
print(f"Content Farm Detection: {'YES' if is_farm else 'NO'}")
print("Scores:")
for metric, score in scores.items():
print(f"{metric}: {score:.3f}")
if __name__ == "__main__":
main()
This is a simplified code that demonstrates how to identify content farms through several key mechanisms:
Network Analysis
- Measures domain centrality and clustering in the web graph
- Analyzes backlink diversity (how many unique domains link to the content)
- Identifies isolated clusters of domains that mainly link to each other
Content Pattern Detection
- Extracts structural patterns from content while ignoring actual words
- Measures similarity between content pieces to detect templated generation
- Tracks pattern diversity over time
Growth Pattern Analysis
- Monitors content creation rate
- Checks for suspiciously consistent posting patterns
- Identifies rapid content generation that would be unlikely for human writers
Red Flags for Content Farms:
- Low backlink diversity (few unique domains linking to the content)
- Low network centrality (isolated from the main web)
- High template similarity in content
- Suspiciously rapid or consistent content generation
- Young domain age with massive content volume
The system uses a combination of these signals to make a final determination. For example, a legitimate site might have:
- Organic linking patterns from diverse sources
- Natural variations in content structure
- Reasonable growth rate
- Mixed posting patterns
While a content farm might show:
- Links mainly from related domains
- Very similar content structures
- Extremely rapid content generation
- Mechanically consistent posting patterns
So no, Cadbury 5-Star does not have the capability to corrupt AI. But yes the campaign is nice and thought-provoking.
Here’s the irony: The entire article is written by an AI with some intervention from my end!
Click here to connect with me on LinkedIn