Charting Intelligence & Data-to-Viz Pipeline Engineering
Enable any agent to (1) instantly identify the exact chart needed from raw data, (2) generate the precise path of queries/transforms to materialize that chart, and (3) evaluate and choose the optimal charting library/stack based on performance, scale, and interactivity requirements.
This is not "just call a library" — it is full-stack visualization strategy.
1. Core Decision Framework — Choosing the Chart That Fits the Data AND the Story
Before any code runs, answer these questions in order:
What is the goal of the viewer?
| Goal | Chart Type |
|---|---|
| Compare values | Bar/Column (grouped or stacked) |
| Show trend over time | Line or Area |
| Show distribution / spread | Histogram, Box Plot, Violin |
| Show relationship / correlation | Scatter, Bubble, Heatmap |
| Show composition / parts-of-whole | Stacked Bar or Area (never pie if >5 slices) |
| Show hierarchy / flow | Treemap, Sunburst, Sankey |
| Show geographic pattern | Choropleth or Symbol Map |
How many variables and what types?
| Variables | Chart |
|---|---|
| 1 numeric, unordered | Histogram / Density |
| 1 numeric + time | Line |
| 1 categorical + 1 numeric | Bar |
| 2 numeric | Scatter |
| 1 categorical + time series | Grouped or Stacked Line/Area |
| Many-to-many relationships | Heatmap or Parallel Coordinates |
Audience & Context Check
| Audience | Approach |
|---|---|
| Executive dashboard | Big numbers + simple bars/lines, zero clutter |
| Analyst/explorer | Interactive tooltips, zoom, hover details, multiple linked views |
| Mobile | Horizontal bars, large text, minimal colors |
| Accessibility | High contrast, patterns instead of color-only, alt-text descriptions |
Rule of Thumb Table
| Data Situation | Best Chart (first choice) | Avoid |
|---|---|---|
| >5 categories | Bar (horizontal) | Pie |
| Time series >20 points | Line | Column |
| Correlation between 2 measures | Scatter | Line (unless ordered) |
| Parts of whole >5 slices | Stacked Bar or Treemap | Pie/Donut |
| Outliers or distribution shape | Box + Violin | Bar |
| Flow between stages | Sankey | Anything else |
2. The Data Pipeline Engine
Most databases do NOT have the exact aggregation ready. Auto-generate the full pipeline:
Step A — Inventory
- Scan schema or sample 100 rows — detect column types, null rates, cardinality
- Flag missing aggregations (e.g., "no daily_sales_by_region view exists")
Step B — Required Transformations
Auto-generate SQL or pandas code for:
- Joins needed?
- GROUP BY + SUM/AVG/COUNT?
- Window functions for running totals or YoY?
- Binning (e.g., age into decades)?
- Pivot/unpivot?
- Outlier flagging or imputation?
Step C — Materialization Strategy
| Scale | Strategy |
|---|---|
| One-off (<10k rows) | Run query on-the-fly |
| Medium | Create materialized view or cached table |
| Large/Real-time | Pre-aggregate in Spark/DuckDB, incremental refresh |
| Extreme | Stream + windowed aggregates (Flink/Kafka) |
Step D — Validation
- Run a tiny sample query first — confirm the shape matches the chosen chart type
- If not, loop back and adjust aggregation
Example
User says "show monthly revenue by product category":
"I need: LEFT JOIN orders -> products -> categories; GROUP BY month, category; SUM(revenue). No view exists -> I will create temp table or run inline. Chart type: Stacked Area. Library recommendation below."
3. Library Selection Matrix
Always output the performance trade-off and recommended stack.
| Scale / Requirement | Recommended Library | Why | Fallback |
|---|---|---|---|
| <10k points, simple web dashboard | Chart.js or Recharts | <10 ms render, ~60 KB bundle | N/A |
| 10k-500k points, interactive | Apache ECharts or Plotly.js | Canvas + WebGL, 60 fps on 100k points | D3 (slower) |
| 500k-10M+ points, real-time | LightningChart or Highcharts Stock + WebGL | GPU accelerated, <50 ms at 5M points | Anything SVG-based fails |
| Python backend + web | Plotly Dash or Bokeh | Server-side render + client streaming | Matplotlib (static only) |
| Python notebook exploration | Seaborn + Plotly | Instant, beautiful defaults | -- |
| Extremely large / streaming | DuckDB + Observable Plot or Perspective | In-memory columnar, sub-second on billions | -- |
| No JavaScript (PDF reports) | Matplotlib + WeasyPrint or ReportLab | Pure Python, vector output | -- |
Optimization Rules (apply automatically)
- Downsample for overview, show full detail on zoom (ECharts built-in)
- Use Canvas instead of SVG above ~5k elements
- Pre-aggregate at DB level whenever possible (biggest single win)
- Lazy load charts below the fold
- Bundle size: tree-shake everything except the one chart type you need
- GPU vs CPU: if >100k points and user needs pan/zoom, force WebGL path
4. Full Workflow
- Parse intent — identify required chart type from user request
- Schema scan — detect column types, cardinality, row estimates
- Decision framework — output chart recommendation + rationale
- Generate transforms — exact SQL/pandas/transform code needed
- Choose library — select by performance tier based on row estimate
- Emit deliverables:
- Chart spec (JSON for the library or React component)
- SQL/transform script
- Performance warning or confirmation
- Accessibility note + alt-text template
5. Advanced Capabilities
- "Show me what I should be charting but aren't" — auto-correlation scan + suggested visuals
- "Optimize this dashboard for 10x speed" — rewrite query + switch library
- "Make this mobile-first" — auto-switch to horizontal bars + simplify
- Color-blind & accessibility mode — toggle patterns, high contrast
- Export — SVG/PNG/PDF with embedded data table