distribution-analyzer

Distribution Analyzer Expert

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "distribution-analyzer" with this command: npx skills add dengineproblem/agents-monorepo/dengineproblem-agents-monorepo-distribution-analyzer

Distribution Analyzer Expert

Эксперт по анализу статистических распределений.

Core Principles

Exploratory Data Analysis

  • Начинайте с визуализации (histograms, Q-Q plots)

  • Descriptive statistics (mean, median, std, skewness, kurtosis)

  • Выявление outliers и аномалий

Multiple Testing

  • Kolmogorov-Smirnov test

  • Anderson-Darling test

  • Shapiro-Wilk test

  • Chi-square goodness-of-fit

Model Selection

  • AIC/BIC criteria

  • Cross-validation

  • Residual analysis

Distribution Families

Continuous Distributions

Normal (Gaussian): Use cases: Natural phenomena, measurement errors Parameters: μ (mean), σ (std) When: Symmetric, additive processes Test: Shapiro-Wilk

Log-Normal: Use cases: Income, stock prices, file sizes Parameters: μ, σ (of log) When: Multiplicative processes, positive skew Test: Transform and test normality

Exponential: Use cases: Time between events, failure times Parameters: λ (rate) When: Memoryless processes Test: K-S test

Gamma: Use cases: Wait times, insurance claims Parameters: α (shape), β (rate) When: Sum of exponential events

Weibull: Use cases: Reliability, survival analysis Parameters: λ (scale), k (shape) When: Failure time modeling

Beta: Use cases: Proportions, probabilities Parameters: α, β When: Bounded [0,1] data

Pareto: Use cases: Wealth, city sizes Parameters: α (shape), xm (scale) When: Power law, heavy tail

Student's t: Use cases: Small samples, heavy tails Parameters: ν (degrees of freedom) When: Normal-like but heavier tails

Discrete Distributions

Poisson: Use cases: Event counts, rare events Parameters: λ (rate) When: Events in fixed interval

Binomial: Use cases: Success/failure trials Parameters: n (trials), p (probability) When: Fixed number of independent trials

Negative Binomial: Use cases: Overdispersed counts Parameters: r, p When: Variance > Mean

Geometric: Use cases: Number of trials until success Parameters: p (probability) When: Waiting for first success

Distribution Analyzer Class

import numpy as np import scipy.stats as stats import matplotlib.pyplot as plt from typing import Dict, List, Tuple, Optional

class DistributionAnalyzer: """Comprehensive distribution analysis toolkit."""

DISTRIBUTIONS = {
    'norm': stats.norm,
    'lognorm': stats.lognorm,
    'expon': stats.expon,
    'gamma': stats.gamma,
    'weibull_min': stats.weibull_min,
    'beta': stats.beta,
    'pareto': stats.pareto,
    't': stats.t
}

def __init__(self, data: np.ndarray):
    self.data = np.array(data)
    self.n = len(data)
    self.results = {}

def descriptive_stats(self) -> Dict:
    """Calculate descriptive statistics."""
    return {
        'n': self.n,
        'mean': np.mean(self.data),
        'median': np.median(self.data),
        'std': np.std(self.data),
        'var': np.var(self.data),
        'min': np.min(self.data),
        'max': np.max(self.data),
        'range': np.ptp(self.data),
        'skewness': stats.skew(self.data),
        'kurtosis': stats.kurtosis(self.data),
        'q25': np.percentile(self.data, 25),
        'q50': np.percentile(self.data, 50),
        'q75': np.percentile(self.data, 75),
        'iqr': stats.iqr(self.data)
    }

def fit_distributions(self) -> Dict:
    """Fit multiple distributions and rank by goodness-of-fit."""
    results = {}

    for name, dist in self.DISTRIBUTIONS.items():
        try:
            # Fit distribution
            params = dist.fit(self.data)

            # Calculate log-likelihood
            log_likelihood = np.sum(dist.logpdf(self.data, *params))

            # Calculate AIC and BIC
            k = len(params)
            aic = 2 * k - 2 * log_likelihood
            bic = k * np.log(self.n) - 2 * log_likelihood

            # K-S test
            ks_stat, ks_pvalue = stats.kstest(self.data, name, params)

            results[name] = {
                'params': params,
                'log_likelihood': log_likelihood,
                'aic': aic,
                'bic': bic,
                'ks_statistic': ks_stat,
                'ks_pvalue': ks_pvalue
            }
        except Exception as e:
            results[name] = {'error': str(e)}

    # Rank by AIC
    valid_results = {k: v for k, v in results.items() if 'aic' in v}
    ranked = sorted(valid_results.items(), key=lambda x: x[1]['aic'])

    self.results = results
    return {
        'all': results,
        'best': ranked[0] if ranked else None,
        'ranking': [(name, res['aic']) for name, res in ranked]
    }

def test_normality(self) -> Dict:
    """Comprehensive normality testing."""
    results = {}

    # Shapiro-Wilk (best for n < 5000)
    if self.n < 5000:
        stat, pvalue = stats.shapiro(self.data)
        results['shapiro_wilk'] = {
            'statistic': stat,
            'pvalue': pvalue,
            'is_normal': pvalue > 0.05
        }

    # D'Agostino-Pearson
    if self.n >= 20:
        stat, pvalue = stats.normaltest(self.data)
        results['dagostino_pearson'] = {
            'statistic': stat,
            'pvalue': pvalue,
            'is_normal': pvalue > 0.05
        }

    # Anderson-Darling
    result = stats.anderson(self.data, dist='norm')
    results['anderson_darling'] = {
        'statistic': result.statistic,
        'critical_values': dict(zip(
            [f'{cv}%' for cv in result.significance_level],
            result.critical_values
        )),
        'is_normal': result.statistic < result.critical_values[2]  # 5%
    }

    # Kolmogorov-Smirnov
    stat, pvalue = stats.kstest(
        self.data, 'norm',
        args=(np.mean(self.data), np.std(self.data))
    )
    results['kolmogorov_smirnov'] = {
        'statistic': stat,
        'pvalue': pvalue,
        'is_normal': pvalue > 0.05
    }

    return results

def detect_outliers(self, method: str = 'iqr') -> Dict:
    """Detect outliers using various methods."""
    results = {'method': method}

    if method == 'iqr':
        q1, q3 = np.percentile(self.data, [25, 75])
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        outliers = self.data[(self.data < lower) | (self.data > upper)]
        results.update({
            'lower_bound': lower,
            'upper_bound': upper,
            'outliers': outliers,
            'n_outliers': len(outliers),
            'outlier_percentage': len(outliers) / self.n * 100
        })

    elif method == 'zscore':
        z_scores = np.abs(stats.zscore(self.data))
        threshold = 3
        outliers = self.data[z_scores > threshold]
        results.update({
            'threshold': threshold,
            'outliers': outliers,
            'n_outliers': len(outliers),
            'outlier_percentage': len(outliers) / self.n * 100
        })

    elif method == 'mad':
        median = np.median(self.data)
        mad = np.median(np.abs(self.data - median))
        threshold = 3.5
        modified_z = 0.6745 * (self.data - median) / mad
        outliers = self.data[np.abs(modified_z) > threshold]
        results.update({
            'median': median,
            'mad': mad,
            'outliers': outliers,
            'n_outliers': len(outliers),
            'outlier_percentage': len(outliers) / self.n * 100
        })

    return results

def plot_analysis(self, figsize: Tuple = (15, 10)):
    """Generate comprehensive diagnostic plots."""
    fig, axes = plt.subplots(2, 3, figsize=figsize)

    # Histogram with KDE
    axes[0, 0].hist(self.data, bins='auto', density=True, alpha=0.7)
    kde_x = np.linspace(self.data.min(), self.data.max(), 100)
    kde = stats.gaussian_kde(self.data)
    axes[0, 0].plot(kde_x, kde(kde_x), 'r-', lw=2)
    axes[0, 0].set_title('Histogram with KDE')

    # Q-Q Plot (Normal)
    stats.probplot(self.data, dist="norm", plot=axes[0, 1])
    axes[0, 1].set_title('Q-Q Plot (Normal)')

    # Box Plot
    axes[0, 2].boxplot(self.data, vert=True)
    axes[0, 2].set_title('Box Plot')

    # ECDF
    sorted_data = np.sort(self.data)
    ecdf = np.arange(1, self.n + 1) / self.n
    axes[1, 0].step(sorted_data, ecdf, where='post')
    axes[1, 0].set_title('Empirical CDF')

    # P-P Plot
    theoretical = stats.norm.cdf(sorted_data,
                                 loc=np.mean(self.data),
                                 scale=np.std(self.data))
    axes[1, 1].scatter(theoretical, ecdf, alpha=0.5)
    axes[1, 1].plot([0, 1], [0, 1], 'r--')
    axes[1, 1].set_title('P-P Plot (Normal)')

    # Violin Plot
    axes[1, 2].violinplot(self.data)
    axes[1, 2].set_title('Violin Plot')

    plt.tight_layout()
    return fig

def bootstrap_ci(self, statistic: str = 'mean',
                 n_bootstrap: int = 10000,
                 confidence: float = 0.95) -> Dict:
    """Calculate bootstrap confidence intervals."""
    stat_func = {
        'mean': np.mean,
        'median': np.median,
        'std': np.std
    }[statistic]

    bootstrap_stats = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(self.data, size=self.n, replace=True)
        bootstrap_stats.append(stat_func(sample))

    alpha = 1 - confidence
    lower = np.percentile(bootstrap_stats, alpha / 2 * 100)
    upper = np.percentile(bootstrap_stats, (1 - alpha / 2) * 100)

    return {
        'statistic': statistic,
        'point_estimate': stat_func(self.data),
        'ci_lower': lower,
        'ci_upper': upper,
        'confidence': confidence,
        'standard_error': np.std(bootstrap_stats)
    }

Usage Examples

Example usage

import numpy as np

Generate sample data

np.random.seed(42) data = np.random.lognormal(mean=2, sigma=0.5, size=1000)

Create analyzer

analyzer = DistributionAnalyzer(data)

Descriptive statistics

print("Descriptive Stats:") print(analyzer.descriptive_stats())

Fit distributions

print("\nDistribution Fitting:") fit_results = analyzer.fit_distributions() print(f"Best fit: {fit_results['best'][0]}") print(f"AIC: {fit_results['best'][1]['aic']:.2f}")

Test normality

print("\nNormality Tests:") norm_results = analyzer.test_normality() for test, result in norm_results.items(): print(f"{test}: p-value = {result.get('pvalue', 'N/A'):.4f}")

Detect outliers

print("\nOutlier Detection:") outliers = analyzer.detect_outliers(method='iqr') print(f"Found {outliers['n_outliers']} outliers ({outliers['outlier_percentage']:.1f}%)")

Bootstrap CI

print("\nBootstrap Confidence Interval:") ci = analyzer.bootstrap_ci('mean', confidence=0.95) print(f"Mean: {ci['point_estimate']:.2f} [{ci['ci_lower']:.2f}, {ci['ci_upper']:.2f}]")

Generate plots

fig = analyzer.plot_analysis() plt.savefig('distribution_analysis.png')

Distribution Selection Guide

Symmetric data:

  • Start with Normal
  • If heavy tails: Student's t
  • If lighter tails: Consider uniform

Right-skewed data:

  • Log-normal for multiplicative
  • Gamma for waiting times
  • Weibull for reliability
  • Exponential if memoryless

Left-skewed data:

  • Reflect and fit right-skewed
  • Beta with appropriate parameters

Bounded data [0, 1]:

  • Beta distribution
  • Truncated normal if near-normal

Count data:

  • Poisson if mean ≈ variance
  • Negative binomial if overdispersed
  • Binomial for fixed trials

Heavy tails:

  • Pareto for power law
  • Student's t
  • Cauchy (extreme)

Common Pitfalls

Visual inspection alone: Problem: Histograms can be misleading Solution: Use multiple tests and Q-Q plots

Ignoring sample size: Problem: Tests behave differently with n Solution: Shapiro-Wilk for small, K-S for large

Single test reliance: Problem: Each test has assumptions Solution: Use multiple complementary tests

Overfitting: Problem: Complex distribution fits noise Solution: AIC/BIC penalize parameters

Parameter uncertainty: Problem: Point estimates hide uncertainty Solution: Bootstrap confidence intervals

Лучшие практики

  • Visualize first — всегда начинайте с графиков

  • Multiple tests — не полагайтесь на один тест

  • Sample size awareness — выбирайте тесты по размеру выборки

  • Domain knowledge — учитывайте физический смысл данных

  • Report uncertainty — давайте confidence intervals

  • Validate — cross-validation для model selection

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

social-media-marketing

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

video-marketing

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

frontend-design

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

k6-load-test

No summary provided by upstream source.

Repository SourceNeeds Review