distribution-search

Distribution Search

Overview

This skill provides structured guidance for finding probability distributions that satisfy specific statistical constraints. These problems typically involve constructing a discrete probability distribution over a large vocabulary that achieves target values for metrics like KL divergence, entropy, or other information-theoretic quantities.

Mathematical Analysis First

Before writing any code, perform thorough mathematical analysis to constrain the solution space:

Derive analytical relationships between the target metrics and distribution properties

For KL divergence from uniform: KL(P||U) = -H(P) + log(V) where H(P) is entropy and V is vocabulary size
For KL divergence to uniform: KL(U||P) = log(V) - (1/V) * Σ log(P(i))

Calculate fixed quantities from the problem specification

Vocabulary size V determines log(V)
Target values combined with log(V) constrain feasible entropy ranges

Identify implied constraints from simultaneous requirements

Multiple target metrics often severely restrict the feasible solution space
Determine if the problem is over-constrained before attempting optimization

Estimate the solution structure analytically

Determine the approximate entropy the distribution must have
Estimate how concentrated or spread the probability mass should be

Parameterization Strategy

High-dimensional optimization over all probability values is impractical. Use reduced parameterizations:

Recommended Parameterizations

Power law with uniform tail: k high-probability elements following p_i ∝ i^(-α), remaining elements share probability equally

Parameters: k (number of special elements), α (power law exponent), p_rest (probability for tail)

Two-group distribution: k elements with probability p_high, remaining elements with probability p_low

Parameters: k, p_high (p_low is determined by normalization)

Multi-tier distribution: Several groups of elements with distinct probability levels

Parameters: group sizes and probability levels

Parameter Selection Guidance

Start with few parameters (2-3) and add flexibility only if needed
Derive parameter bounds from mathematical constraints rather than arbitrary choices
Calculate reasonable initial values from the analytical solution estimates

Implementation Approach

Modular Code Structure

Organize code into separate, tested functions:

kl_forward(P, V): Compute KL(P||Uniform)
kl_backward(P, V): Compute KL(Uniform||P)
create_distribution(params, V): Generate distribution from parameterization
objective(params): Optimization objective combining constraints
verify_solution(P, V, targets): Independent verification

Computational Efficiency

Avoid full array operations when possible - use analytical formulas for symmetric distributions
For two-group distributions: compute entropy/KL directly from group sizes and probabilities
Estimate computational cost before running - set appropriate timeouts

Optimization Strategy

Coarse grid search over reduced parameter space to find promising regions
Local optimization from multiple starting points in promising regions
Refinement with tighter tolerances near solutions

Verification

Independent Verification

Always verify solutions with code independent from the optimization:

Check probability constraints: All values non-negative, sum to 1.0
Compute metrics directly: Use explicit formulas, not the optimization objective
Test against known cases: Verify computation on uniform distribution or other known solutions

Common Verification Bugs

Incorrect handling of log(0) - use appropriate thresholds (e.g., max(p, 1e-30))
Array indexing errors when elements are counted vs indexed
Forgetting to normalize after parameter adjustments
Multiplication errors when computing sums over groups

Common Pitfalls

Computational

Starting with full-dimensional optimization: Immediately recognize the need for reduced parameterization
Insufficient timeout estimation: Estimate iterations needed before running
Rewriting entire scripts: Use modular code to enable targeted fixes

Mathematical

Incomplete constraint analysis: Fully leverage mathematical relationships before coding
Arbitrary parameter bounds: Derive bounds from problem constraints
Poor initial values: Use analytical estimates for starting points

Verification

Trusting optimization output directly: Always verify with independent computation
Not testing verification code: Verify the verifier using known solutions first
Numerical precision issues: Handle small probabilities carefully

Problem-Solving Workflow

Analyze: Derive analytical relationships and estimate solution structure
Parameterize: Choose reduced parameterization matching expected structure
Implement: Write modular, tested code for each component
Search: Grid search followed by local optimization
Verify: Independent verification of candidate solutions
Refine: Adjust parameterization if tolerances not achieved

When Solutions Are Not Found

If optimization fails to find valid solutions:

Check if the problem is mathematically feasible given constraints
Verify the parameterization can represent valid solutions
Expand the parameterization to add flexibility
Check for bugs in objective function or constraint handling
Try different optimization algorithms or starting points

distribution-search

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

extracting-pdf-text

video-processing

google-workspace

portfolio-optimization