Can Cadbury Make AI Mediocre Again? A Scientific Analysis

5 min readDec 7, 2024

The new Cadbury 5Star ad indicates that the company has server farm that spits out nonsensical websites and content.

The aim is to mislead AI by corrupting their training data. The foundation of the idea lies in the “fact” that AI models are trained using huge amounts of data on the internet.

Now, of course it is a marketing campaign, but lets explore the feasibility of this tactic.

The Feasibility of Cadbury Making AI Mediocre With Server Farm(s)

On a very basic level, companies like Anthropic, OpenAI, etc. don’t train on raw internet data. The data undergoes extensive validation checks — often involving humans.

But there are more sophisticated and automated techniques to filter out gibberish content.

One such technique is — Perplexity Check

“Perplexity is a mathematical measure of how well a language model predicts a sample of text. In the context of data filtering, perplexity checks help identify nonsensical or machine-generated text by measuring how “surprised” a reference model is by the content.”

# Example thresholds (values are illustrative)
THRESHOLDS = {
    'natural_text': 50,
    'technical_text': 100,
    'creative_writing': 75,
    'suspicious_threshold': 200
}

def filter_content(text, domain='natural_text'):
    perplexity_score = calculate_perplexity(text)
    
    if perplexity_score > THRESHOLDS[domain]:
        return 'filtered_out'
    return 'accepted'

Thus, the Cadbury campaign’s strategy of generating nonsensical content would likely fail — perplexity checks would easily identify and filter out such content before it ever reached the training data. Modern AI training pipelines use perplexity as just one of many quality control measures, making it very difficult to deliberately introduce low-quality data into the training process.

AI companies also use data source filtering. Data is often weighted based on source reliability.

From a scalability point of view, modern language models train on hundreds of terabytes or even petabytes of data. A single server farm would need enormous resources to make a meaningful impact. Thus the signal-to-noise ratio would make random content statistically insignificant.

Then comes cross validation. AI models does not generally trust one source for a specific piece of information. Cross-validation across multiple data sources ensures consistency. Content clustering helps identify and exclude outlier data.

Server Isolation Checking and Content Pattern Identification

It’s relatively easy for big AI companies to detect how isloated a server is against the entire scheme of internet. If the server is quite isolated, chances are it is part of the server farm. Similarly, content pattern checks can reflect the same result.\

Look at this code:

import networkx as nx
from collections import defaultdict
from typing import Dict, List, Set, Tuple
from datetime import datetime, timedelta
import numpy as np
from urllib.parse import urlparse
import re

class ContentFarmDetector:
    def __init__(self):
        self.web_graph = nx.Graph()
        self.content_patterns = defaultdict(list)
        self.domain_stats = defaultdict(dict)
        self.thresholds = {
            'min_backlinks': 5,
            'min_unique_domains': 3,
            'max_content_similarity': 0.8,
            'suspicious_growth_rate': 100,  # pages per day
            'min_domain_age_days': 30,
            'max_template_similarity': 0.7
        }

    def analyze_domain_network(self, domain: str) -> Dict[str, float]:
        """
        Analyze domain's position and relationships in the web graph
        """
        if domain not in self.web_graph:
            return {
                'centrality': 0.0,
                'clustering': 0.0,
                'backlink_diversity': 0.0
            }

        # Get all neighbors (linked domains)
        neighbors = set(self.web_graph.neighbors(domain))
        
        # Calculate network metrics
        centrality = nx.degree_centrality(self.web_graph)[domain]
        clustering = nx.clustering(self.web_graph, domain)
        
        # Analyze backlink diversity
        backlink_domains = set()
        for neighbor in neighbors:
            parsed = urlparse(neighbor)
            backlink_domains.add(parsed.netloc)
        
        backlink_diversity = len(backlink_domains) / max(len(neighbors), 1)
        
        return {
            'centrality': centrality,
            'clustering': clustering,
            'backlink_diversity': backlink_diversity
        }

    def analyze_content_patterns(self, domain: str, content: str) -> Dict[str, float]:
        """
        Detect templated or machine-generated content patterns
        """
        def extract_structure(text: str) -> str:
            # Extract structural patterns (simplified)
            # Remove actual words but keep punctuation and structure
            structure = re.sub(r'\w+', 'WORD', text)
            return structure
        
        def calculate_similarity(text1: str, text2: str) -> float:
            # Simplified similarity calculation
            struct1 = extract_structure(text1)
            struct2 = extract_structure(text2)
            
            # Compare structures using Levenshtein distance
            from difflib import SequenceMatcher
            return SequenceMatcher(None, struct1, struct2).ratio()

        # Store new content pattern
        self.content_patterns[domain].append({
            'content': content,
            'structure': extract_structure(content),
            'timestamp': datetime.now()
        })
        
        # Analyze patterns
        recent_patterns = self.content_patterns[domain][-50:]  # Look at last 50 pieces
        
        # Calculate average structural similarity
        similarities = []
        for i in range(len(recent_patterns)):
            for j in range(i + 1, len(recent_patterns)):
                sim = calculate_similarity(
                    recent_patterns[i]['content'],
                    recent_patterns[j]['content']
                )
                similarities.append(sim)
        
        avg_similarity = np.mean(similarities) if similarities else 0
        
        return {
            'template_similarity': avg_similarity,
            'pattern_count': len(set(p['structure'] for p in recent_patterns))
        }

    def analyze_growth_patterns(self, domain: str) -> Dict[str, float]:
        """
        Analyze suspicious growth patterns
        """
        patterns = self.content_patterns[domain]
        if not patterns:
            return {'growth_rate': 0.0, 'consistency_score': 0.0}
        
        # Calculate content creation rate
        timestamps = [p['timestamp'] for p in patterns]
        time_range = max(timestamps) - min(timestamps)
        days = time_range.total_seconds() / (24 * 3600) or 1
        growth_rate = len(patterns) / days
        
        # Analyze timestamp consistency
        time_diffs = np.diff([ts.timestamp() for ts in sorted(timestamps)])
        consistency_score = np.std(time_diffs) if len(time_diffs) > 0 else 0
        
        return {
            'growth_rate': growth_rate,
            'consistency_score': consistency_score
        }

    def is_content_farm(self, domain: str) -> Tuple[bool, Dict[str, float]]:
        """
        Make final determination if a domain is likely a content farm
        """
        # Collect all metrics
        network_metrics = self.analyze_domain_network(domain)
        latest_content = self.content_patterns[domain][-1]['content'] if self.content_patterns[domain] else ""
        content_metrics = self.analyze_content_patterns(domain, latest_content)
        growth_metrics = self.analyze_growth_patterns(domain)
        
        # Combine all metrics
        scores = {**network_metrics, **content_metrics, **growth_metrics}
        
        # Decision logic
        is_farm = (
            network_metrics['backlink_diversity'] < 0.3 or
            network_metrics['centrality'] < 0.1 or
            content_metrics['template_similarity'] > self.thresholds['max_template_similarity'] or
            growth_metrics['growth_rate'] > self.thresholds['suspicious_growth_rate']
        )
        
        return is_farm, scores

# Example usage
def main():
    detector = ContentFarmDetector()
    
    # Add some example domains and connections
    real_domain = "legitimate-site.com"
    farm_domain = "content-farm-example.com"
    
    # Add legitimate domain with good connections
    detector.web_graph.add_edge(real_domain, "reference1.com")
    detector.web_graph.add_edge(real_domain, "reference2.com")
    detector.web_graph.add_edge(real_domain, "reference3.com")
    
    # Add suspected content farm with minimal connections
    detector.web_graph.add_edge(farm_domain, "farm-mirror1.com")
    
    # Simulate some content patterns
    for _ in range(10):
        # Legitimate content with variations
        detector.content_patterns[real_domain].append({
            'content': f"Unique content {np.random.randint(1000)}",
            'structure': "WORD WORD WORD",
            'timestamp': datetime.now() - timedelta(days=np.random.randint(30))
        })
        
        # Farm content with high similarity
        detector.content_patterns[farm_domain].append({
            'content': f"Generated content number {_}",
            'structure': "WORD WORD WORD NUMBER",
            'timestamp': datetime.now() - timedelta(minutes=np.random.randint(60))
        })
    
    # Check both domains
    for domain in [real_domain, farm_domain]:
        is_farm, scores = detector.is_content_farm(domain)
        print(f"\nAnalysis for {domain}:")
        print(f"Content Farm Detection: {'YES' if is_farm else 'NO'}")
        print("Scores:")
        for metric, score in scores.items():
            print(f"{metric}: {score:.3f}")

if __name__ == "__main__":
    main()

This is a simplified code that demonstrates how to identify content farms through several key mechanisms:

Network Analysis

Measures domain centrality and clustering in the web graph
Analyzes backlink diversity (how many unique domains link to the content)
Identifies isolated clusters of domains that mainly link to each other

Content Pattern Detection

Extracts structural patterns from content while ignoring actual words
Measures similarity between content pieces to detect templated generation
Tracks pattern diversity over time

Growth Pattern Analysis

Monitors content creation rate
Checks for suspiciously consistent posting patterns
Identifies rapid content generation that would be unlikely for human writers

Red Flags for Content Farms:

Low backlink diversity (few unique domains linking to the content)
Low network centrality (isolated from the main web)
High template similarity in content
Suspiciously rapid or consistent content generation
Young domain age with massive content volume

The system uses a combination of these signals to make a final determination. For example, a legitimate site might have:

Organic linking patterns from diverse sources
Natural variations in content structure
Reasonable growth rate
Mixed posting patterns

While a content farm might show:

Links mainly from related domains
Very similar content structures
Extremely rapid content generation
Mechanically consistent posting patterns

So no, Cadbury 5-Star does not have the capability to corrupt AI. But yes the campaign is nice and thought-provoking.

Here’s the irony: The entire article is written by an AI with some intervention from my end!

Click here to connect with me on LinkedIn

Can Cadbury Make AI Mediocre Again? A Scientific Analysis

The Feasibility of Cadbury Making AI Mediocre With Server Farm(s)

One such technique is — Perplexity Check

Server Isolation Checking and Content Pattern Identification

Written by Marifur Rahaman

No responses yet