Project Sharing and Output Preparation

Expert guidance for preparing project outputs for sharing with collaborators, reviewers, or repositories. Creates organized packages at different sharing levels while preserving your working directory.

When to Use This Skill

Sharing analysis results with collaborators
Preparing supplementary materials for publications
Creating reproducible research packages
Archiving completed projects
Handoff to other researchers
Submitting to data repositories

Core Principles

Work on copies - Never modify the working directory
Choose appropriate level - Match sharing depth to audience needs
Document everything - Include clear guides and metadata
Clean before sharing - Remove debug code, clear outputs, anonymize if needed
Make it reproducible - Include dependencies and instructions
⚠️ CRITICAL: After creating sharing folder, all future work happens in the main project directory, NOT in the sharing folder - Sharing folders are read-only snapshots

Three Sharing Levels

Level 1: Summary Only

Purpose: Quick sharing for presentations, reports, or high-level review

What to include:

PDF export of final notebook(s)
Final data/results (CSV, Excel, figures) - optional
Brief README

Use when:

Sharing results with non-technical stakeholders
Presentations or talks
Quick review without reproduction needs
Space/time constraints

Structure:

shared-summary/ ├── README.md # Brief overview ├── analysis-YYYY-MM-DD.pdf # Notebook as PDF └── results/ ├── figures/ │ ├── fig1-main-result.png │ └── fig2-comparison.png └── tables/ └── summary-statistics.csv

Level 2: Reproducible

Purpose: Enable others to reproduce your analysis from processed data

What to include:

Analysis notebooks (.ipynb) - cleaned
Scripts for figure generation
Processed/analysis-ready data
Requirements file (requirements.txt or environment.yml)
Detailed README with instructions

Use when:

Sharing with collaborating researchers
Peer review / manuscript supplementary materials
Teaching or tutorials
Standard collaboration needs

Structure:

For standard project structures, see the folder-organization skill. Reproducible packages should include:

Processed data (in data/processed/ )
Cleaned notebooks (in notebooks/ ) with outputs cleared
Scripts (in scripts/ )
Environment specification (environment.yml or requirements.txt )
Documentation (README.md , MANIFEST.md )

shared-reproducible/ ├── README.md # Setup and reproduction instructions ├── MANIFEST.md # File descriptions ├── environment.yml # Conda environment OR requirements.txt ├── notebooks/ # Cleaned notebooks ├── scripts/ # Standalone scripts └── data/ └── processed/ # Analysis-ready data

Level 3: Full Traceability

Purpose: Complete transparency from raw data through all processing steps

What to include:

Starting/raw data
All processing scripts and notebooks
All intermediate files
Final results
Complete documentation
Full dependency specification

Use when:

Archiving for future reference
Regulatory compliance
High-stakes reproducibility (clinical, policy)
Data repository submission (Zenodo, Dryad, etc.)
Complete project handoff

Structure:

For standard project structures, see the folder-organization skill. Full traceability packages should include complete data hierarchy:

shared-complete/ ├── README.md # Complete project guide ├── MANIFEST.md # Comprehensive file listing ├── environment.yml ├── data/ │ ├── raw/ # Original, unmodified data │ ├── intermediate/ # Processing steps │ └── processed/ # Final analysis-ready ├── scripts/ # All processing scripts ├── notebooks/ # All notebooks (exploratory + final) ├── results/ # All outputs │ ├── figures/ │ ├── tables/ │ └── supplementary/ └── documentation/ # Complete documentation ├── methods.md ├── changelog.md └── data-dictionary.md

Preparation Workflow

Step 1: Ask User for Sharing Level

Questions to determine level:

Which sharing level do you need?

Summary Only - PDF + final results (quick sharing)
Reproducible - Notebooks + scripts + data (standard sharing)
Full Traceability - Everything from raw data (archival/compliance)

Additional questions:

Who is the audience? (colleagues, reviewers, public)
Are there size constraints?
Any sensitive data to handle?
Timeline for sharing?

Step 2: Identify Files to Include

For each level, identify:

Level 1 - Summary:

Main analysis notebook(s)
Key figures (publication-quality)
Summary tables/statistics
Optional: Final processed dataset

Level 2 - Reproducible:

All analysis notebooks (not exploratory)
Figure generation scripts
Processed/cleaned data
Environment specification
Any utility functions/modules

Level 3 - Full:

Raw data (or links if too large)
All processing scripts
All notebooks (including exploratory)
All intermediate files
Complete documentation

Step 3: Create Sharing Directory

Create dated directory

SHARE_DIR="shared-$(date +%Y%m%d)-[level]" mkdir -p "$SHARE_DIR"

Create subdirectories based on level

... appropriate structure from above

Step 4: Copy and Clean Files

For notebooks (.ipynb):

import nbformat from nbconvert.preprocessors import ClearOutputPreprocessor

def clean_notebook(input_path, output_path): """Clean notebook: clear outputs, remove debug cells."""

# Read notebook
with open(input_path, 'r') as f:
    nb = nbformat.read(f, as_version=4)

# Clear outputs
clear_output = ClearOutputPreprocessor()
nb, _ = clear_output.preprocess(nb, {})

# Remove cells tagged as 'debug' or 'remove'
nb.cells = [cell for cell in nb.cells
            if 'debug' not in cell.metadata.get('tags', [])
            and 'remove' not in cell.metadata.get('tags', [])]

# Write cleaned notebook
with open(output_path, 'w') as f:
    nbformat.write(nb, f)

For data files:

Copy as-is for small files
Consider compression for large files
Check for sensitive information

For scripts:

Remove debugging code
Add docstrings if missing
Ensure paths are relative

Step 4.5: CRITICAL - Verify and Fix File Paths

Problem: Notebooks and scripts with broken file paths will fail when shared. Common issues:

Absolute paths instead of relative paths
Files reorganized after paths were written
Data files moved to different subdirectories
Paths reference parent directories no longer in package

For complete path verification procedures, automated checking scripts, and correction patterns, see the folder-organization skill.

Quick Verification

Key things to check before sharing:

❌ Breaks when shared ✅ Works when shared

/Users/yourname/project/data.csv

data/data.csv

C:\Users\yourname\project\fig.png

figures/fig.png

/absolute/path/to/results/

results/

Image('/Users/you/notebook.png')

Image('figures/notebook.png')

Quick check commands:

Check for absolute paths in notebooks

grep -l "/Users/" *.ipynb grep -l "C:\\" *.ipynb

Search for file references that may need updating

grep -r ".json|.csv|.png" --include=".ipynb" --include=".py"

Testing Before Sharing

Copy to temp location and test

cp -r project /tmp/test-project cd /tmp/test-project

Try to open notebook and check for errors

jupyter nbconvert --to html notebook.ipynb 2>&1 | grep -i error

Benefits of Automatic Verification

Catches broken paths before distribution
Suggests correct paths automatically
Finds missing files in package
Prevents "file not found" errors for recipients
Professional, polished packages

When to Run

Always run as part of share-project command (Step 5.5)
After any file reorganization
Before final package distribution
After moving notebooks to sharing directory

Step 5: Generate Documentation

README.md Template

Project: [Project Name]

Date: YYYY-MM-DD Author: [Your Name] Sharing Level: [Summary/Reproducible/Full]

Overview

Brief description of the project and analysis.

See MANIFEST.md for detailed file descriptions.

Requirements

[For Reproducible/Full levels]

Python 3.X
See environment.yml for dependencies

Setup

```bash

Create environment

conda env create -f environment.yml conda activate project-name ```

Reproduction Steps

[For Reproducible/Full levels]

[Description of first step] ```bash jupyter notebook notebooks/01-analysis.ipynb ```
[Description of second step]

Data Sources

[For Full level]

Dataset A: [Source, download date, version]
Dataset B: [Source, download date, version]

Contact

[Your email or preferred contact]

License

[If applicable - e.g., CC BY 4.0, MIT]

MANIFEST.md Template

File Manifest

Generated: YYYY-MM-DD

Directory Structure

``` shared-YYYYMMDD/ ├── README.md - Project overview and setup ├── MANIFEST.md - This file [... complete tree ...] ```

File Descriptions

Notebooks

`notebooks/01-data-processing.ipynb` - Initial data loading and cleaning
`notebooks/02-analysis.ipynb` - Main statistical analysis
`notebooks/03-visualization.ipynb` - Figure generation for publication

Data

`data/processed/cleaned_data.csv` - Quality-controlled dataset (N=XXX samples)
- Columns: [list key columns]
- Missing values handled by [method]

Scripts

`scripts/generate_figures.py` - Automated figure generation
- Usage: `python generate_figures.py --input data/processed/cleaned_data.csv`

Results

`results/figures/fig1-main.png` - Main result showing [description]
`results/tables/summary_stats.csv` - Descriptive statistics

[Continue for all files...]

Step 6: Handle Sensitive Data

Check for sensitive information:

Personal identifiable information (PII)
Access credentials (API keys, passwords)
Proprietary data
Institutional data with sharing restrictions
Patient/subject identifiers

Strategies:

Anonymize - Remove or hash identifiers
Exclude - Don't include sensitive files
Aggregate - Share summary statistics only
Document restrictions - Note what's excluded and why

Example anonymization:

import hashlib

def anonymize_ids(df, id_column='subject_id'): """Replace IDs with hashed values.""" df[id_column] = df[id_column].apply( lambda x: hashlib.sha256(str(x).encode()).hexdigest()[:8] ) return df

Step 7: Package and Compress

For smaller packages (<100MB):

Create zip archive

zip -r shared-YYYYMMDD.zip shared-YYYYMMDD/

For larger packages:

Create tar.gz (better compression)

tar -czf shared-YYYYMMDD.tar.gz shared-YYYYMMDD/

Or split into parts if very large

tar -czf - shared-YYYYMMDD/ | split -b 1G - shared-YYYYMMDD.tar.gz.part

Document package contents:

Total size
Number of files
Compression method
How to extract

Step 8: Return to Working Directory

⚠️ IMPORTANT: After creating the sharing package, always work in the main project directory.

The sharing folder is a snapshot for distribution only. Any future development, analysis, or modifications should happen in your original working directory, not in the shared-*/ folder.

Claude should:

Change directory back to main project: cd .. (if needed)
Confirm working directory: pwd
Continue all work in the original project location
Treat sharing folders as read-only archives

Example:

After creating sharing package

cd /path/to/main/project # Return to working directory pwd # Verify location

Continue work here, NOT in shared-YYYYMMDD/

Streamlining Notebooks for Sharing

When preparing Jupyter notebooks for sharing packages, remove verbose content while preserving essential analysis:

Content to Remove

Historical information:

"Comparison to previous analysis" sections
"Earlier version" notes
Revision history and change logs
Update summaries

Excessive methodological detail:

Extended reconciliation discussions
Overly detailed interpretations (>2000 chars)
Redundant technical notes
Multiple paragraphs explaining the same concept

Verbose analysis sections:

Repetitive explanations
Extended discussions of minor points
Implementation details better suited for code comments

Content to Keep

Essential analysis:

Figure displays and captions
Statistical results
Key findings summaries

Core documentation:

Methods sections
Conclusions
Data source descriptions
Quality metrics

Reproduction information:

Setup instructions
Script execution order
Environment requirements

Automated Streamlining Script

import nbformat import re

def should_remove_cell(cell): """Identify verbose cells to remove.""" if cell.cell_type != 'markdown': return False

source = cell.source.lower()

# Patterns indicating verbose content
verbose_patterns = [
    'comparison to.*previous',
    'reconciliation with',
    'why focus on',
    'methodological.*note.*filtering',
    'interpretation of the significant difference',
]

for pattern in verbose_patterns:
    if re.search(pattern, source):
        return True

# Remove overly long cells without figures
if len(source) > 2000 and 'figure' not in source.lower():
    return True

return False

def streamline_notebook(input_path, output_path): """Remove verbose content from notebook.""" with open(input_path, 'r') as f: nb = nbformat.read(f, as_version=4)

# Filter out verbose cells
nb.cells = [cell for cell in nb.cells if not should_remove_cell(cell)]

with open(output_path, 'w') as f:
    nbformat.write(nb, f)

Expected Results

Cell reduction: 7-15% of cells typically removed
Size reduction: 15-20% smaller HTML files
Quality improvement: More professional, focused presentation
No loss: All essential content preserved

Verification Checklist

After streamlining:

All figures still referenced
Statistical results present
Methods sections complete
Conclusions included
HTML conversion successful
No broken cells or formatting

Abridge Option: Automated Streamlining with Full Version Preservation

The /share-project command now includes an abridge option that automates notebook streamlining while preserving full versions for reference.

When to Use Abridge

Use abridge when:

Notebooks contain historical/comparison sections
You want professional, concise presentation
Recipients need clean notebooks but you want full transparency
Analysis notebooks have become verbose over iterations

Skip abridge when:

Notebooks are already concise
Historical context is essential to understanding
Notebooks are for detailed review or audit

How Abridge Works

When you select the abridge option during /share-project :

Creates structure:

shared-package/ ├── Notebook.ipynb ← Abridged (professional) ├── Notebook.html ← From abridged version └── unabridged/ ├── Notebook.ipynb ← Full version preserved └── Notebook.html ← From full version

Automatically removes:

Historical notes ("Comparison to previous analysis")
Revision history and change logs
"Why focus on X" explanatory sections
Methodological reconciliation discussions
Overly detailed interpretations (>2000 chars without results)
Update summaries

Always preserves:

All code cells (full reproducibility)
Figure displays and captions
Statistical results and tables
Methods sections
Conclusions and findings
Data source references

Usage

During /share-project workflow:

📝 Verbosity Level

Would you like to create abridged versions of notebooks/documentation?

No (default): Include full content as-is
Yes (abridge): Remove verbose content for cleaner presentation

Creates two versions: ├── [file].ipynb ← Abridged (professional, concise) └── unabridged/ └── [file].ipynb ← Full version (all content preserved)

Create abridged versions? (y/n):

Select 'y' to enable abridging.

Benefits

Professional presentation: Recipients see clean, focused notebooks
Full transparency: Complete versions preserved in unabridged/
No information loss: Everything available for reference
Time saving: Automated instead of manual cell removal
Reproducible: All code and essential content preserved
HTML for both: Both versions get HTML exports

Example Results

Typical improvements:

7-15% fewer cells
15-20% smaller HTML files
More focused, professional presentation
Better readability for non-technical reviewers

Example notebook:

Original: 120 cells, 5.2 MB HTML
Abridged: 105 cells (12.5% reduction), 4.3 MB HTML (17% reduction)
Content: All 43 figures preserved, all statistical results intact
Removed: 15 historical/comparison cells

Verification

After abridging, automatically verify:

All figures present in abridged version
Statistical results visible
Methods sections complete
unabridged/ folder exists with full versions
Both versions have HTML exports

Integration with Other Features

Works seamlessly with:

Path verification (Step 5.5): Paths fixed in both versions
Documentation filtering: Applies same principles to markdown files
Root-level notebooks: Abridged at root, unabridged in subfolder
HTML export: Both versions get HTML for easy viewing

Example workflow:

Select files for root level
Choose reproducible package
Enable abridge option (y)
Path verification runs
Abridging creates two versions
HTML generated for both
Result: Professional package with full transparency

When Not to Use

Skip abridging if:

Notebooks already concise (<50 cells)
Historical context is essential
Audience needs complete development history
Notebooks are for regulatory/compliance review (use full only)
Iterative explanation is pedagogical value

Manual Override

If you need custom abridging patterns, you can modify the patterns in the command:

verbose_patterns = [ r'comparison to.*previous', # Historical comparisons r'why focus on', # Explanatory sections r'reconciliation with', # Methodological discussions # Add your custom patterns here ]

Quality Assurance for Sharing Packages

After creating a sharing package, verify completeness:

Structure Verification

Check directory structure

find package/ -type d | sort

Expected structure:

package/

├── notebooks/

├── scripts/

├── data/

├── figures/

└── documentation/

File Count Verification

Count files by type

echo "Notebooks: $(find package/ -name '.ipynb' | wc -l)" echo "Scripts: $(find package/ -name '.py' | wc -l)" echo "Data files: $(find package/ -name '.csv' | wc -l)" echo "Figures: $(find package/ -name '.png' | wc -l)" echo "Documentation: $(find package/ -name '*.md' | wc -l)"

Path Verification

Test that notebook paths work:

cd package/notebooks/

Try to convert notebooks (tests paths)

for nb in *.ipynb; do jupyter nbconvert --to html "$nb" --output /tmp/test.html 2>&1 |
grep -E "(Error|FileNotFound)" && echo "ERROR in $nb" || echo "OK: $nb" done

Documentation Checklist

Verify documentation is complete:

README.md with setup instructions
MANIFEST.md listing all files
Environment specification (environment.txt or .yml)
License file (if applicable)
Citation information
Data source documentation

Create Verification Notes

Document package status:

VERIFICATION_NOTES.md

Package Contents

Notebooks: X files
Scripts: Y files
Data: Z MB
Figures: N/M present

Known Limitations

Missing figures can be generated using scripts X, Y, Z
Platform-specific environment file

Testing

Notebooks load correctly
Paths work from notebooks/
HTML conversion successful
All essential files present

Size Check

Check package size

du -sh package/

Ideal sizes:

Level 1 (Summary): <5 MB

Level 2 (Reproducible): 5-50 MB

Level 3 (Full): varies

Best Practices

Notebook Cleaning

Before sharing notebooks:

Clear all outputs

jupyter nbconvert --clear-output --inplace notebook.ipynb

Remove debug cells

Tag cells for removal: Cell → Cell Tags → add "remove"
Filter during copy

Add markdown explanations

Ensure each code cell has context
Add section headers
Document assumptions

Check cell execution order

Run "Restart & Run All" to verify
Fix any out-of-order dependencies

Remove absolute paths

❌ Bad

data = pd.read_csv('/Users/yourname/project/data.csv')

✅ Good

data = pd.read_csv('../data/data.csv')

or

from pathlib import Path data_dir = Path(file).parent / 'data'

File Organization

Naming conventions for shared files:

Use descriptive names: telomere_analysis_results.csv not results.csv
Include dates for time-sensitive data: data_2024-01-15.csv
Version if applicable: analysis_v2.ipynb
No spaces: use - or _

Size considerations:

Document large files in README
Consider hosting large data separately (institutional storage, Zenodo)
Provide download links instead of including in package
Use .gitattributes for large file tracking if using Git

Documentation Requirements

Minimum documentation for each level:

Level 1 - Summary:

What the results show
Key findings
Date and author

Level 2 - Reproducible:

Setup instructions
How to run the analysis
Software dependencies
Expected runtime
Data source information

Level 3 - Full:

Complete methodology
All data sources with versions
Processing decisions and rationale
Known issues or limitations
Contact information

Dependency Management

Create requirements file:

For pip:

From active environment

pip freeze > requirements.txt

Or manually curated (better)

cat > requirements.txt << EOF pandas>=1.5.0 numpy>=1.23.0 matplotlib>=3.6.0 scipy>=1.9.0 EOF

For conda:

Export current environment

conda env export > environment.yml

Or minimal (recommended)

conda env export --from-history > environment.yml

Then edit to remove build-specific details

Common Scenarios

Scenario 1: Sharing with Lab Collaborators

Level: Reproducible

Include:

Cleaned analysis notebooks
Processed data
Figure generation scripts
environment.yml
README with reproduction steps

Don't include:

Exploratory notebooks
Failed analysis attempts
Debug outputs
Personal notes

Scenario 2: Manuscript Supplementary Material

Level: Reproducible or Full (depending on journal)

Include:

All notebooks used for figures in paper
Scripts for each figure panel
Processed data (or instructions to obtain)
Complete environment specification
Detailed methods document

Best practices:

Number notebooks to match paper sections
Export key figures in publication formats (PDF, high-res PNG)
Include data dictionary for all variables
Test reproduction on clean environment

Scenario 3: Project Archival

Level: Full Traceability

Include:

Complete data pipeline from raw to processed
All versions of analysis
Meeting notes or decision logs
External tool versions
System information

Organization tips:

Use dates in directory names
Keep chronological changelog
Document all external dependencies
Include contact info for questions

Scenario 4: Data Repository Submission (Zenodo, Figshare)

Level: Full Traceability

Additional considerations:

Add LICENSE file (CC BY 4.0, MIT, etc.)
Include CITATION.cff or CITATION.txt
Comprehensive metadata
README with DOI/reference instructions
Consider maximum file sizes
Review repository-specific guidelines

Quality Checklist

Before finalizing the sharing package:

File Quality

All notebooks run without errors
Notebook outputs cleared
No absolute paths in code
No hardcoded credentials or API keys
File sizes documented
Large files compressed or linked

Documentation

README explains setup and usage
MANIFEST describes all files
Data sources documented
Dependencies specified
Contact information included
License specified (if applicable)

Reproducibility

Requirements file tested in clean environment
All data accessible (included or linked)
Scripts run in documented order
Expected outputs match actual outputs
Processing time documented

Privacy & Sensitivity

No sensitive data included
Identifiers anonymized if needed
Institutional policies checked
Collaborator permissions obtained

Organization

Clear directory structure
Consistent naming conventions
Files logically grouped
No duplicate files
No unnecessary files (cache, .DS_Store, etc.)

Verify Data File References After Consolidation

After consolidating or moving data files, verify all code references:

Find all data loading statements:

import re

Search for read_csv patterns

csv_reads = re.findall(r'read_csv('"['"]', content)

Check against valid files

VALID_FILES = { 'data/vgp_assemblies_unified_corrected.csv', 'data/vgp_assemblies_3categories.csv', }

for csv_file in csv_reads: if 'vgp_assemblies' in csv_file and csv_file not in VALID_FILES: print(f"⚠️ DEPRECATED: {csv_file}")

Check notebooks and scripts:

Find all notebooks

find . -name "*.ipynb" -type f

Find all Python scripts

find scripts -name "*.py" -type f

Search for deprecated file patterns

grep -r "vgp_assemblies_unified.csv" .ipynb scripts/.py

Create verification report:

List each notebook/script
Show which file(s) it loads
Mark as ✓ CORRECT or ⚠️ DEPRECATED
Document expected files vs. actual

Update deprecated references:

Data processing scripts: Update to check deprecated/ folder
Analysis notebooks: Update to use consolidated files
Add comments noting file deprecation

Integration with Other Skills

Works well with:

folder-organization - Ensures source project is well-organized before sharing
jupyter-notebook-analysis - Creates notebooks that are share-ready
managing-environments - Documents dependencies properly

Before using this skill:

Organize working directory (folder-organization)
Finalize analysis (jupyter-notebook-analysis)
Document environment (managing-environments)

After using this skill:

Test package in clean environment
Share via appropriate channel (email, repository, cloud storage)
Keep archived copy for reference

Example Scripts

Create Sharing Package Script

#!/usr/bin/env python3 """Create sharing package for project."""

import shutil from pathlib import Path from datetime import date import nbformat from nbconvert.preprocessors import ClearOutputPreprocessor

def create_sharing_package(level='reproducible', output_dir=None): """ Create sharing package.

Args:
    level: 'summary', 'reproducible', or 'full'
    output_dir: Output directory name (auto-generated if None)
"""

# Create output directory
if output_dir is None:
    output_dir = f"shared-{date.today():%Y%m%d}-{level}"

share_path = Path(output_dir)
share_path.mkdir(exist_ok=True)

print(f"Creating {level} sharing package in {share_path}")

# Create structure based on level
if level == 'summary':
    create_summary_package(share_path)
elif level == 'reproducible':
    create_reproducible_package(share_path)
elif level == 'full':
    create_full_package(share_path)

print(f"✓ Package created: {share_path}")
print(f"  Review and compress: tar -czf {share_path}.tar.gz {share_path}")

def clean_notebook(input_path, output_path): """Clean notebook outputs and debug cells.""" with open(input_path) as f: nb = nbformat.read(f, as_version=4)

# Clear outputs
clear = ClearOutputPreprocessor()
nb, _ = clear.preprocess(nb, {})

# Remove debug cells
nb.cells = [c for c in nb.cells
            if 'debug' not in c.metadata.get('tags', [])]

with open(output_path, 'w') as f:
    nbformat.write(nb, f)

... implement level-specific functions ...

if name == 'main': import sys level = sys.argv[1] if len(sys.argv) > 1 else 'reproducible' create_sharing_package(level)

Summary

Key principles for project sharing:

🎯 Choose the right level - Match sharing depth to audience needs
📋 Copy, don't move - Preserve your working directory
🧹 Clean thoroughly - Remove debug code, clear outputs
📝 Document everything - README + MANIFEST minimum
🔒 Check sensitivity - Anonymize or exclude as needed
✅ Test before sharing - Run in clean environment
📦 Package properly - Compress and document contents
⚠️ Work in main directory - After creating sharing package, ALL future work happens in the original project directory, NOT in the sharing folder

Remember: Good sharing practices benefit both collaborators and your future self!

Correcting Cleanup Mistakes

Identifying Missing Files After Cleanup

When users report missing figures, scripts, or resources after a cleanup:

Check notebook/code references systematically:

Find all image references in notebooks

grep -n ".png|Image(|display(" notebook.ipynb

Find script imports

grep -n "import|from.*import|.py" notebook.ipynb

Find data file references

grep -n ".csv|.tsv|.json" notebook.ipynb

Search deprecated folders:

Find specific files

find deprecated -name "pattern" -type f

Search by file type

find deprecated -name ".png" -o -name ".py"

Check original location:

Sometimes files are in unexpected locations

find . -name "phylogenetic_tree*"

Restoration Workflow

Verify the file is truly needed:

Check if it's referenced in active notebooks
Confirm it's not redundant with other files
Check if other notebooks also use it

Restore systematically:

Restore to original location

cp deprecated/path/to/file.png figures/target/

Update sharing packages if they exist

cp deprecated/path/to/file.png sharing-package/figures/target/

Document the restoration:

Create a restoration summary document
List what was restored and why
Update MINIMAL_ESSENTIAL_FILES.md or equivalent

Example: Figure Restoration

1. Identify missing figures from notebook

grep -o "'[^']*.png'" Curation_Impact_Analysis.ipynb

2. Find them in deprecated

find deprecated -name "05_terminal_telomeres.png"

3. Restore systematically

for fig in 05_terminal_telomeres.png 07_chromosome_assignment_comparison.png; do cp "deprecated/figures/curation_impact/$fig" figures/curation_impact/ cp "deprecated/figures/curation_impact/$fig" shared-package/figures/curation_impact/ done

4. Verify restoration

ls -lh figures/curation_impact/*.png | wc -l # Should match expected count

Prevention: Better Cleanup Verification

Before finalizing cleanup:

Grep all notebooks for references:

Find all figure references across notebooks

grep -h ".png" *.ipynb | sort | uniq

Cross-reference with existing files:

Compare referenced vs existing

comm -13 <(ls figures//.png | sort) <(grep -oh "[^/]*.png" *.ipynb | sort | uniq)

Test notebook execution (if feasible):

Open each notebook
Check that all figures load
Verify no broken image references

Lesson: Always verify notebook dependencies before moving files to deprecated. Use grep to find all references before cleanup operations.

Deprecating Redundant Notebooks

Identifying Redundancy

A notebook may be redundant if:

Content overlaps with another notebook
No unique figures: All visualizations come from shared directories
No unique scripts: All code is shared with other analyses
No unique data: Uses the same datasets as other notebooks

Dependency Analysis Workflow

Before deprecating a notebook, check what it uses:

1. Check figure references

grep -o "Image(filename.*.png" notebook.ipynb | sort | uniq

2. Check if figures are unique or shared

for fig in $(grep -oh "[^/]*.png" notebook.ipynb); do grep -l "$fig" *.ipynb | grep -v notebook.ipynb done

3. Check data file usage

grep -o "read_csv.*.csv" notebook.ipynb

4. Check script imports (if any)

grep -o "^import|^from.*import" notebook.ipynb

Safe Deprecation Process

Document the notebook's purpose:

What analysis does it perform?
Why was it created?
What makes it redundant now?

Verify no unique dependencies:

Check figures

figures_used=$(grep -oh "[^'"]*.png" notebook_to_deprecate.ipynb | sort | uniq)

Check if any are unique to this notebook

for fig in $figures_used; do other_notebooks=$(grep -l "$fig" *.ipynb | grep -v notebook_to_deprecate | wc -l) if [ "$other_notebooks" -eq 0 ]; then echo "WARNING: $fig is only used by this notebook" fi done

Move notebook only (if no unique dependencies):

Move notebook to deprecated

mv Redundant_Notebook.ipynb deprecated/

Remove from sharing packages

rm -f sharing-package/notebooks/Redundant_Notebook.ipynb

Document the deprecation: Create a DEPRECATION_SUMMARY.md :

Notebook Deprecation: [Name]

Date: YYYY-MM-DD Reason: [Brief explanation]

Analysis

Figures: [all shared/unique ones moved]
Scripts: [all shared/unique ones moved]
Data: [all shared]

Files Moved

Notebook: [path] → deprecated/
Figures: [none/list]
Scripts: [none/list]

Active Notebooks

[List remaining active notebooks]

Restoration

cp deprecated/Notebook.ipynb .

Example: Both_Haplotypes Deprecation

# 1. Check dependencies
grep "\.png" Curation_Impact_Analysis_Both_Haplotypes.ipynb
# Result: All figures from figures/curation_impact/

grep "\.png" Curation_Impact_Analysis.ipynb
# Result: Same figures - confirmed shared

# 2. Check data files
grep "\.csv" Curation_Impact_Analysis_Both_Haplotypes.ipynb
# Result: vgp_assemblies_unified.csv (shared with main notebook)

# 3. Safely deprecate (notebook only, no other files)
mv Curation_Impact_Analysis_Both_Haplotypes.ipynb deprecated/
rm shared-package/notebooks/Curation_Impact_Analysis_Both_Haplotypes.ipynb

# 4. Document
echo "All figures, scripts, and data remain active (shared)" > documentation/DEPRECATION_SUMMARY.md

Key Principle: Only move the notebook if ALL dependencies are shared with active notebooks. Never move shared resources to deprecated.

⚠️ Critical Reminder for Claude

After creating any sharing package:

- Always return to the main project directory

- Never work in shared-*/
 directories - These are read-only snapshots

- All future edits, analysis, and development happen in the original working directory

- Sharing folders are for distribution only, not active development

If the user asks to modify files, always check the current directory and ensure you're working in the main project location, not in a sharing package.

project-sharing

Safety Notice

Copy this and send it to your AI assistant to learn

Create dated directory

Create subdirectories based on level

... appropriate structure from above

Check for absolute paths in notebooks

Search for file references that may need updating

Copy to temp location and test

Try to open notebook and check for errors

Project: [Project Name]

Overview

Contents

Requirements

Setup

Create environment

Reproduction Steps

Data Sources

Contact

License

File Manifest

Directory Structure

File Descriptions

Notebooks

Data

Scripts

Results

Create zip archive

Create tar.gz (better compression)

Or split into parts if very large

After creating sharing package

Continue work here, NOT in shared-YYYYMMDD/

Check directory structure

Expected structure:

package/

├── notebooks/

├── scripts/

├── data/

├── figures/

└── documentation/

Count files by type

Try to convert notebooks (tests paths)

VERIFICATION_NOTES.md

Package Contents

Known Limitations

Testing

Check package size

Ideal sizes:

Level 1 (Summary): <5 MB

Level 2 (Reproducible): 5-50 MB

Level 3 (Full): varies

❌ Bad

✅ Good

or

From active environment

Or manually curated (better)

Export current environment

Or minimal (recommended)

Then edit to remove build-specific details

Search for read_csv patterns

Check against valid files

Find all notebooks

Find all Python scripts

Search for deprecated file patterns

... implement level-specific functions ...

Find all image references in notebooks

Find script imports

Find data file references

Find specific files

Search by file type

Sometimes files are in unexpected locations

Restore to original location

Update sharing packages if they exist

1. Identify missing figures from notebook

2. Find them in deprecated

3. Restore systematically

4. Verify restoration

Find all figure references across notebooks

Compare referenced vs existing

1. Check figure references