Dataset Management Patterns
Reference patterns for creating and managing Dataiku datasets via the Python API.
Dataset Types
Type Use When Creation Method
Managed Output of recipes, stored in a connection (SQL, HDFS, etc.) project.new_managed_dataset(name)
Uploaded Importing local files (CSV, Excel, etc.) project.create_upload_dataset(name) or project.create_dataset(name, "UploadedFiles", ...)
SQL Table Pointing to an existing database table project.create_dataset(name, "Snowflake", ...)
Create a Managed Dataset
builder = project.new_managed_dataset("MY_OUTPUT") builder.with_store_into("connection_name") ds = builder.create()
Configure table location (SQL databases)
settings = ds.get_settings() raw = settings.get_raw() raw["params"]["schema"] = "MY_SCHEMA" raw["params"]["table"] = "MY_OUTPUT" settings.save()
Upload a File
ds = project.create_dataset( "my_dataset", "UploadedFiles", params={"uploadConnection": "filesystem_managed"} )
with open("path/to/data.csv", "rb") as f: ds.uploaded_add_file(f, "data.csv")
Auto-detect schema from file contents
settings = ds.autodetect_settings(infer_storage_types=True) settings.save()
Simpler alternative: Use create_upload_dataset to skip the manual params configuration:
ds = project.create_upload_dataset("my_dataset")
with open("path/to/data.csv", "rb") as f: ds.uploaded_add_file(f, "data.csv")
Common Column Types
Dataiku Type Description
string
Text
int / bigint
Integer / Large integer
double / float
Decimal numbers
boolean
True/False
date
Date only
See references/column-types.md for the full type table.
Core Schema Operations
Get Schema
ds = project.get_dataset("my_dataset") schema = ds.get_settings().get_schema() for col in schema["columns"]: print(f"{col['name']}: {col['type']}")
Set Schema
settings = ds.get_settings() settings.set_schema({"columns": [ {"name": "id", "type": "string"}, {"name": "amount", "type": "double"}, ]}) settings.save()
Auto-detect Schema
settings = dataset.autodetect_settings() settings.save()
Note: autodetect_settings() is a method on DSSDataset , not on DSSDatasetSettings . It returns a new settings object with the detected schema applied.
See references/schema-operations.md for join compatibility checks, helper functions, and advanced operations.
SQL Schema Rule
Output datasets for SQL-based recipes MUST have schemas set before building. Without this, Dataiku generates CREATE TABLE () ... which fails.
For SQL databases (Snowflake, BigQuery), use UPPERCASE column names. Lowercase names get quoted, causing "invalid identifier" errors.
Normalize column names to uppercase for SQL
raw = settings.get_raw() for col in raw.get("schema", {}).get("columns", []): col["name"] = col["name"].upper() settings.save()
List Datasets in Project
datasets = project.list_datasets() for ds in datasets: print(f"- {ds['name']} ({ds.get('type', 'unknown')})")
Common Issues
Issue Cause Solution
Schema mismatch Recipe output doesn't match Run autodetect_settings()
Join fails Key type mismatch Check types, cast if needed
Missing columns Schema not updated Rebuild dataset, update schema
Parse errors Wrong type detection Manually set schema
Detailed References
-
references/column-types.md — Full column type table with Python equivalents
-
references/schema-operations.md — All schema operations, join compatibility checks, helper functions