NotebookLM Knowledge Base Organizer
Prepares files for optimal use in NotebookLM by intelligently selecting and consolidating sources, converting formats, organizing structure, and ensuring compatibility. The primary constraint is NotebookLM's 50-source limit per notebook. When collections exceed this limit, systematic scoring, prioritization, and strategic merging reduce source count without losing valuable information.
When to Use This Skill
- You have 50+ files and need to optimize for NotebookLM's limit
- Preparing documents for a new NotebookLM notebook
- Converting a messy folder into NotebookLM-ready sources
- Files are in unsupported formats (PPTX, XLSX, complex PDFs)
- Documents exceed 500k words or 200MB per file
- Building a knowledge base for research, projects, or learning
- Large document collections (100-300 files) need intelligent prioritization
What This Skill Does
- Scores and Prioritizes Sources (when >50 detected) using Relevance, Recency, Uniqueness, and Information Density (0-40 scale)
- Strategic Merging via time-series (daily to monthly), topic-based (related papers to comprehensive guides), and format consolidation (slides + transcript to unified PDF)
- Converts to Supported Formats (PPTX to PDF, XLSX to CSV, scanned to OCR)
- Applies Flat Structure with descriptive snake_case naming
- Removes Duplicates across formats
- Splits Large Files exceeding 500k words into parts
- Optimizes for RAG with smaller, focused documents for better retrieval
NotebookLM Supported Formats
Supported:
- PDF (text-selectable, not scanned images)
- Google Docs, Sheets (<100k tokens), Slides (<100 slides)
- Microsoft Word (.docx)
- Text files (.txt, .md)
- Images (PNG, JPEG, TIFF, WEBP)
- Audio (MP3, WAV, AAC, OGG with clear speech)
- URLs (websites, YouTube, Google Drive links)
- Copy-pasted text
Convert These:
- PPTX to PDF
- XLSX to CSV or Google Sheets
- Scanned PDFs to OCR text-selectable PDF
- Large Sheets to CSV (<100k tokens)
File Limits
Per Source:
- 500,000 words max
- 200MB file size max
- No page limit (word limit matters)
Per Notebook (Free):
- 50 sources maximum -- HARD LIMIT
- 100 notebooks total
Prefer many smaller, focused documents over few large ones for better RAG retrieval. The 50-source limit is the primary optimization constraint.
IMPORTANT: Preserve original file timestamps during all operations. Timestamps
are essential for understanding latest additions, recent meeting minutes, and
key decisions. Use touch -r original converted after conversions. Include
dates in ISO format (YYYY-MM-DD) in all filenames.
How to Use
Prepare these files for NotebookLM - convert formats and organize with descriptive names
Convert all PPTX and XLSX files to NotebookLM-compatible formats
Check if any files exceed NotebookLM's 500k word or 200MB limits
Organize this research folder for a NotebookLM knowledge base
Find duplicate content across different file formats
Split this large PDF into NotebookLM-compatible chunks
Instructions
When a user requests NotebookLM organization, follow these steps.
Step 1: Assess and Prioritize Sources
Count and evaluate before proceeding with any organization.
total_sources=$(find . -type f \( -name "*.pdf" -o -name "*.docx" -o -name "*.txt" -o -name "*.md" -o -name "*.csv" \) | wc -l)
echo "Total sources found: $total_sources"
If total exceeds 50:
-
Score all sources using the 4-dimension rubric (Relevance, Recency, Uniqueness, Density, each 0-10). See
references/scoring-system.mdfor the full rubric, assessment commands, and batch scoring script. -
Rank and select top candidates using the decision matrix. Target 35-40 auto-keep sources initially. See
references/prioritization-strategy.mdfor the selection process and space-based adjustments. -
Identify merge candidates -- find time-series patterns, topic clusters, and multi-format duplicates:
# Time-series opportunities find . -name "*_20[0-9][0-9]_[0-9][0-9]_*" | \ sed 's/_20[0-9][0-9]_[0-9][0-9]_[0-9][0-9]//' | sort | uniq -c | sort -rn # Topic clusters find . -type f -name "*.pdf" | xargs -I {} basename {} .pdf | \ sed 's/_part_[0-9]*//;s/_[0-9][0-9]*$//' | sort | uniq -c | sort -rn | awk '$1 > 2' -
Execute strategic merges using appropriate patterns. See
references/merging-strategies.mdfor time-series, topic-based, and format consolidation scripts. Preserve timestamps on all merged outputs. -
Recount and validate the final total is at or below 50 (ideally 48 to reserve slots for future additions).
Step 2: Understand the Scope
Ask clarifying questions:
- What is the topic/purpose of this knowledge base?
- Which directory contains the source materials?
- Target: single notebook or multiple related notebooks?
- Any files that must stay in original format?
- Is this for research, learning, project documentation, or reference?
Step 3: Analyze Current State
Review files for NotebookLM compatibility:
find . -type f -exec file {} \;
find . -type f -exec du -h {} \; | sort -rh
find . -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn
for f in *.pdf; do pdftotext "$f" - | wc -w; done
Categorize findings:
- Compatible as-is: PDF, DOCX, TXT, MD, images
- Needs conversion: PPTX, XLSX, XLS, PPT, scanned PDFs
- Too large: Files >500k words or >200MB
- Duplicates: Same content in different formats
- Merge candidates: Sources identified for consolidation in Step 1
Step 4: Convert Unsupported Formats
PowerPoint to PDF:
soffice --headless --convert-to pdf *.pptx
touch -r original.pptx converted.pdf # Preserve timestamp
Excel to CSV:
soffice --headless --convert-to csv:"Text - txt - csv (StarCalc)":44,34,UTF8 *.xlsx
touch -r original.xlsx converted.csv # Preserve timestamp
Scanned PDF to Searchable:
ocrmypdf input.pdf output_searchable.pdf
touch -r input.pdf output_searchable.pdf # Preserve timestamp
pdftotext output_searchable.pdf - | wc -w # Verify text extraction
WARNING: Always run touch -r original converted after every conversion to preserve the original file timestamp.
Step 5: Apply Naming
Use this pattern: category_topic_descriptor_YYYY_MM_DD.ext
Examples:
research_quantum_computing_basics_2025.pdfmeeting_notes_project_kickoff_2026_01_15.txtclient_proposal_acme_corp_final.docxreference_api_documentation_v2.mddata_sales_figures_q4_2025.csv
See references/organization-scripts.md for the automated naming script. Preserve timestamps when renaming: use mv (preserves by default) and verify with stat.
Step 6: Split Large Documents
For files >500k words or >200MB:
pdftotext document.pdf - | wc -w # Check word count
pdftk large.pdf cat 1-500 output large_part_1.pdf
pdftk large.pdf cat 501-1000 output large_part_2.pdf
touch -r large.pdf large_part_1.pdf large_part_2.pdf # Preserve timestamps
Name parts by content, not arbitrary numbers:
annual_report_2025_part_1_executive_summary.pdfannual_report_2025_part_2_financials.pdfannual_report_2025_part_3_appendices.pdf
Step 7: Consolidation Pass
Perform strategic merging to optimize source count. This step is critical when merge candidates were identified in Step 1 or the collection is near the 50-source limit.
Merging is a primary optimization strategy, not a last resort. Three patterns apply:
- Time-series: Combine chronological documents into period summaries (daily to monthly, weekly to quarterly)
- Topic-based: Combine related papers/docs into comprehensive guides with chapter markers
- Format consolidation: Combine slides + transcript + notes for the same event into a single PDF
See references/merging-strategies.md for full merge patterns, scripts (time-series merger, topic-based PDF merger), decision trees, and quality checks.
IMPORTANT: Preserve chronological timestamps in merged content. Add clear date headers within merged files so temporal context is not lost.
Log all merge decisions for inclusion in the organization plan.
Step 8: Implement Flat Structure
NotebookLM works best with flat source lists, no nested folders.
Before:
docs/
project/
planning/
requirements.pdf
research/
background.pdf
reference/
api_docs.pdf
After:
notebooklm_sources/
project_requirements_2026.pdf
project_background_research.pdf
reference_api_documentation.pdf
See references/organization-scripts.md for the implementation script. Preserve timestamps when copying: use cp -p to maintain original dates.
Step 9: Find and Remove Duplicates
find . -type f -exec md5 {} \; | sort | uniq -d
find . -type f -printf '%f\n' | sed 's/\.[^.]*$//' | sort | uniq -d
for pdf in *.pdf; do echo "=== $pdf ==="; pdftotext "$pdf" - | md5; done | sort
Decision matrix:
- Same content, different formats: keep PDF (best for NotebookLM)
- Same content, different names: keep most descriptive name
- Slight variations: merge into single document if <500k words
- Truly duplicate: delete older version (check timestamps first)
Step 10: Optimize for RAG
NotebookLM uses RAG, which works best with focused documents:
- Split 100-page documents into 3-5 topic-focused files
- Separate chapters/sections into individual sources
- Keep each source focused on one topic/subtopic
- Prefer 20-50 pages per PDF over 200+ page megadocs
Instead of:
company_handbook_500_pages.pdf
Create:
handbook_code_of_conduct.pdf
handbook_benefits_overview.pdf
handbook_time_off_policy.pdf
handbook_remote_work_guidelines.pdf
handbook_career_development.pdf
Step 11: Propose Organization Plan
Present a plan to the user before making changes. The plan should cover current state, source selection strategy (if >50 sources), proposed structure, changes to make, and a compatibility check.
See references/organization-plan-template.md for the full template with sections for prioritization results, merge decisions, and final source count verification.
Step 12: Execute Organization
After user approval, execute all conversions, merges, renames, and structural changes. Log all operations.
See references/organization-scripts.md for the complete execution script with logging and limit verification. Run touch -r after every file operation to preserve original timestamps.
Step 13: Provide Upload Instructions
Provide the user with a summary of organized sources and upload instructions for NotebookLM (direct upload and Google Drive options).
See references/upload-guide.md for the full upload instructions template including maintenance guidance.
Examples
Example 1: Research Paper Collection
User: "Prepare my PhD research papers folder for NotebookLM"
Process:
- Finds 35 PDFs, 12 DOCX, 8 PPTX across nested folders
- Converts 8 PPTX to PDF (preserves timestamps)
- Identifies 2 papers >500k words, splits into parts
- Renames:
smith_2024.pdftoresearch_quantum_entanglement_smith_2024.pdf - Creates flat structure in
phd_research_sources/ - Result: 48 sources ready for upload
Example 2: Company Knowledge Base
User: "Convert our company wiki exports to NotebookLM format"
Split single 145-page PDF by section into 7 focused sources:
company_overview_history_mission.pdf(8 pages)company_policies_hr_guidelines.pdf(28 pages)company_product_documentation.pdf(45 pages)- (4 more topic-focused files)
Result: 7 focused sources instead of 1 large doc. Better RAG retrieval.
Example 3: Excel Data
User: "I have 10 Excel files with research data"
Convert each sheet to separate CSV. Name descriptively: data_survey_responses_2025.csv. Create overview doc: data_overview_methodology.txt. Preserve timestamps on all conversions.
Result: 10 XLSX to 23 CSV files + 1 overview doc.
Example 4: Conference Materials
User: "Organize my conference materials for a knowledge base"
Input: 12 MP3 recordings, 8 PPTX decks, 15 JPG notes, 5 PDFs. Keep MP3 as-is (NotebookLM transcribes on upload). Convert PPTX to PDF. Keep JPGs (NotebookLM reads handwriting via OCR). Apply naming: conf_session_title_speaker_date.ext. Preserve all timestamps.
Result: 40 sources in flat folder.
Example 5: Large Collection (200+ Sources)
For a complete workflow handling 200+ sources (e.g., reducing 237 sources to 48 with strategic merging), see references/large-collection-workflow.md.
Common Patterns
Academic Research
research_[topic]_[author]_[year].pdf
notes_[course]_[topic]_[date].md
textbook_[subject]_chapter_[n]_[title].pdf
Business Projects
project_[name]_requirements.pdf
project_[name]_timeline.csv
meeting_[project]_[date]_notes.txt
client_[name]_proposal_final.docx
Learning/Courses
course_[name]_lecture_[n]_[topic].pdf
course_[name]_readings_week_[n].pdf
course_[name]_assignment_[n].docx
Personal Knowledge Base
article_[topic]_[author]_[date].pdf
book_notes_[title]_[author].md
tutorial_[skill]_[topic].pdf
reference_[tool]_documentation.pdf
Pro Tips
-
Optimize for Search: Use descriptive names with search keywords. Good:
tutorial_python_async_programming_advanced.pdf. Bad:tutorial_5.pdf. -
Topic-Based Splitting: Split large docs by topic, not arbitrary page count. Good:
handbook_benefits.pdf,handbook_policies.pdf. Bad:handbook_part_1.pdf,handbook_part_2.pdf. -
Date Formatting: Use ISO format (YYYY-MM-DD) for sortability. Good:
meeting_notes_2026_02_04.txt. Bad:meeting_notes_feb_4_2026.txt. -
Preserve Source Timestamps: Always maintain original file creation/modification dates. These enable accurate recency scoring and help NotebookLM's RAG weight recent meeting notes, decisions, and additions appropriately. Use
touch -r original convertedafter every conversion. -
Extract Text from Scans: Scanned PDFs do not work in NotebookLM. Test with
pdftotext test.pdf - | head. If blank, runocrmypdf input.pdf output.pdf. -
Use Prefixes for Ordering: Add numeric prefixes for logical ordering:
01_project_overview.pdf,02_project_requirements.pdf. -
Test Before Bulk Upload: Upload 2-3 files first to verify processing, summaries, and search accuracy. Then upload the rest.
Best Practices Summary
Source Selection and Optimization:
- Always assess total source count first before organizing
- Use scoring rubric for objective prioritization (>50 sources)
- Merge strategically as primary optimization, not last resort
- Prefer quality over quantity: 48 great sources over 50 mediocre ones
- Reserve 2-3 slots for future additions
- Do not merge high-value unique sources (score 35+)
- Do not combine unrelated topics just to hit limits
File Naming:
- Descriptive snake_case with searchable terms and ISO dates
- Keep under 100 characters, no spaces or special characters
- Use dates instead of version numbers
Format Selection:
- PDF for presentations and mixed content
- CSV for spreadsheet data
- DOCX/TXT/MD for text documents
- Always convert PPTX and XLSX before upload
Timestamp Preservation:
- Run
touch -r original convertedafter every conversion - Use
cp -pwhen copying files to preserve modification dates - Include ISO dates in filenames for explicit temporal context
- Timestamps drive recency scoring and RAG relevance weighting
Organization Structure:
- Flat structure (one folder, all files)
- Descriptive names include folder context
- Stay under 50 sources per notebook
Implementation Checklist
Phase 1: Assessment and Prioritization
- Identify target notebook topic/purpose
- Locate all source files and count total
- If >50: run scoring rubric for all sources
- If >50: identify and execute strategic merges
- If >50: select top sources using decision matrix (target 48)
- Check file formats, note conversions needed
- Estimate word counts for large files
Phase 2: Conversion and Organization
- Convert unsupported formats (preserve timestamps)
- Apply descriptive snake_case naming
- Split large documents by topic
- Remove duplicates
- Create flat output directory
- Verify all files <200MB and <500k words
- Verify final source count is at or below 50
- Verify timestamps preserved on all converted/moved files
Phase 3: Upload and Verification
- Document selection strategy in organization plan
- Test upload 2-3 files
- Upload remaining sources
- Verify NotebookLM processing and summaries
- Test search functionality
- Confirm all key topics covered despite any source reduction