data-science

Data analysis, SQL, and insights generation.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-science" with this command: npx skills add htlin222/dotfiles/htlin222-dotfiles-data-science

Data Science

Data analysis, SQL, and insights generation.

When to Use

  • Writing SQL queries

  • Data analysis and exploration

  • Creating visualizations

  • Statistical analysis

  • ETL and data pipelines

SQL Patterns

Common Queries

-- Aggregation with window functions SELECT user_id, order_date, amount, SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) as running_total, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) as recency_rank FROM orders;

-- CTEs for readability WITH monthly_stats AS ( SELECT DATE_TRUNC('month', created_at) as month, COUNT(*) as total_orders, SUM(amount) as revenue FROM orders GROUP BY 1 ), growth AS ( SELECT month, revenue, LAG(revenue) OVER (ORDER BY month) as prev_revenue, (revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) as growth_rate FROM monthly_stats ) SELECT * FROM growth;

BigQuery Specifics

-- Partitioned table query SELECT * FROM project.dataset.events WHERE DATE(_PARTITIONTIME) BETWEEN '2024-01-01' AND '2024-01-31';

-- UNNEST for arrays SELECT user_id, item FROM project.dataset.orders, UNNEST(items) as item;

-- Approximate counts for large data SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users FROM project.dataset.events;

Python Analysis

import pandas as pd import numpy as np

Load and explore

df = pd.read_csv('data.csv') df.info() df.describe()

Clean and transform

df['date'] = pd.to_datetime(df['date']) df = df.dropna(subset=['required_field']) df['category'] = df['category'].fillna('Unknown')

Aggregate

summary = df.groupby('category').agg({ 'value': ['mean', 'sum', 'count'], 'date': ['min', 'max'] }).round(2)

Visualize

import matplotlib.pyplot as plt df.groupby('date')['value'].sum().plot(figsize=(12, 6)) plt.title('Daily Values') plt.savefig('chart.png', dpi=150, bbox_inches='tight')

Statistical Analysis

from scipy import stats

Hypothesis testing

t_stat, p_value = stats.ttest_ind(group_a, group_b)

Correlation

correlation = df['x'].corr(df['y'])

Regression

from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X, y) print(f"R² = {model.score(X, y):.3f}")

Output Format

Analysis Summary

Question: [What we're trying to answer] Data Source: [Tables/files used] Date Range: [Time period]

Key Findings

  1. [Finding with supporting metric]
  2. [Finding with supporting metric]

Visualization

[Chart description or embedded image]

Recommendations

  • [Actionable insight]

Examples

Input: "Analyze user retention" Action: Query cohort data, calculate retention rates, visualize trends

Input: "Find top customers" Action: Write SQL for RFM analysis, segment users, summarize findings

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

research-plan-implement

No summary provided by upstream source.

Repository SourceNeeds Review
General

c-lang

No summary provided by upstream source.

Repository SourceNeeds Review
General

cpp

No summary provided by upstream source.

Repository SourceNeeds Review
General

javascript

No summary provided by upstream source.

Repository SourceNeeds Review