content-similarity-checker

Compare document similarity using TF-IDF, cosine similarity, and Jaccard index. Use for plagiarism detection, duplicate finding, or content matching.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "content-similarity-checker" with this command: npx skills add dkyazzentwatwa/chatgpt-skills/dkyazzentwatwa-chatgpt-skills-content-similarity-checker

Content Similarity Checker

Compare documents and text for similarity using multiple algorithms.

Features

  • Cosine Similarity: TF-IDF based comparison
  • Jaccard Similarity: Set-based comparison
  • Levenshtein Distance: Edit distance for short texts
  • Batch Comparison: Compare multiple documents
  • Similarity Matrix: Pairwise comparison of all documents
  • Reports: Detailed similarity reports

Quick Start

from similarity_checker import SimilarityChecker

checker = SimilarityChecker()

# Compare two texts
score = checker.compare(
    "The quick brown fox jumps over the lazy dog",
    "A fast brown fox leaps over a sleepy dog"
)
print(f"Similarity: {score:.2%}")

# Compare documents
score = checker.compare_files("doc1.txt", "doc2.txt")

CLI Usage

# Compare two texts
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"

# Compare two files
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt

# Compare all files in folder
python similarity_checker.py --folder ./documents/ --output matrix.csv

# Use specific algorithm
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard

# Find similar documents (threshold)
python similarity_checker.py --folder ./documents/ --threshold 0.7

# JSON output
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json

API Reference

SimilarityChecker Class

class SimilarityChecker:
    def __init__(self, method: str = "cosine")

    # Text comparison
    def compare(self, text1: str, text2: str) -> float
    def compare_files(self, file1: str, file2: str) -> float

    # Multiple algorithms
    def compare_all_methods(self, text1: str, text2: str) -> dict

    # Batch comparison
    def compare_to_corpus(self, text: str, corpus: list) -> list
    def similarity_matrix(self, documents: list) -> pd.DataFrame
    def find_duplicates(self, documents: list, threshold: float = 0.8) -> list

    # Folder operations
    def compare_folder(self, folder: str, threshold: float = None) -> dict
    def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list

    # Report
    def generate_report(self, output: str) -> str

Similarity Methods

Cosine Similarity (Default)

Best for comparing documents of different lengths:

checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0

Jaccard Similarity

Good for comparing sets of words/tokens:

checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0

Levenshtein (Edit Distance)

Best for short texts, typo detection:

checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0 (normalized)

TF-IDF + Cosine

Advanced: considers term importance:

checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)

Batch Comparison

Compare to Corpus

checker = SimilarityChecker()

target = "Machine learning is a subset of artificial intelligence."
corpus = [
    "AI includes machine learning and deep learning.",
    "Python is a programming language.",
    "Neural networks power deep learning systems."
]

results = checker.compare_to_corpus(target, corpus)

# Returns:
[
    {"index": 0, "similarity": 0.65, "text": "AI includes..."},
    {"index": 2, "similarity": 0.42, "text": "Neural networks..."},
    {"index": 1, "similarity": 0.12, "text": "Python is..."}
]

Similarity Matrix

documents = [
    "Document one content...",
    "Document two content...",
    "Document three content..."
]

matrix = checker.similarity_matrix(documents)

# Returns DataFrame:
#          doc_0    doc_1    doc_2
# doc_0    1.000    0.750    0.320
# doc_1    0.750    1.000    0.410
# doc_2    0.320    0.410    1.000

Find Duplicates

documents = [...]  # List of texts

duplicates = checker.find_duplicates(documents, threshold=0.85)

# Returns:
[
    {"doc1_index": 0, "doc2_index": 3, "similarity": 0.92},
    {"doc1_index": 2, "doc2_index": 7, "similarity": 0.88}
]

Compare All Methods

Get similarity scores from all algorithms:

checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)

# Returns:
{
    "cosine": 0.82,
    "jaccard": 0.65,
    "levenshtein": 0.71,
    "tfidf": 0.78,
    "average": 0.74
}

Folder Operations

Compare All Files in Folder

checker = SimilarityChecker()
results = checker.compare_folder("./documents/")

# Returns:
{
    "files": ["doc1.txt", "doc2.txt", "doc3.txt"],
    "comparisons": 3,
    "similar_pairs": [
        {"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
    ],
    "matrix": <DataFrame>
}

Find Most Similar to Query

query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)

# Returns:
[
    {"file": "doc3.txt", "similarity": 0.89},
    {"file": "doc1.txt", "similarity": 0.72},
    ...
]

Output Format

Comparison Result

result = checker.compare_with_details(text1, text2)

# Returns:
{
    "similarity": 0.82,
    "method": "cosine",
    "text1_length": 150,
    "text2_length": 180,
    "common_words": 25,
    "unique_words_text1": 10,
    "unique_words_text2": 15,
    "interpretation": "High similarity - likely related content"
}

Example Workflows

Plagiarism Check

checker = SimilarityChecker()

submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")

suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]

if suspicious:
    print(f"Warning: Found {len(suspicious)} potentially similar sources")
    for p in suspicious:
        print(f"  {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")

Document Deduplication

checker = SimilarityChecker()

# Load all documents
docs = {}
for file in Path("./articles/").glob("*.txt"):
    docs[file.name] = file.read_text()

# Find near-duplicates
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)

print(f"Found {len(duplicates)} duplicate pairs")

Content Matching

checker = SimilarityChecker()

query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)

print("Most relevant articles:")
for r in results:
    print(f"  {r['file']}: {r['similarity']:.0%} match")

Dependencies

  • scikit-learn>=1.3.0
  • nltk>=3.8.0
  • numpy>=1.24.0
  • pandas>=2.0.0

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ocr-document-processor

No summary provided by upstream source.

Repository SourceNeeds Review
General

text-summarizer

No summary provided by upstream source.

Repository SourceNeeds Review
General

document-converter-suite

No summary provided by upstream source.

Repository SourceNeeds Review
General

financial-calculator

No summary provided by upstream source.

Repository SourceNeeds Review