Java Development in Apache Beam
Project Structure
Key Directories
-
sdks/java/core
-
Core Java SDK (PCollection, PTransform, Pipeline)
-
sdks/java/harness
-
SDK harness (container entrypoint)
-
sdks/java/io/
-
I/O connectors (51+ connectors including BigQuery, Kafka, JDBC, etc.)
-
sdks/java/extensions/
-
Extensions (SQL, ML, protobuf, etc.)
-
runners/
-
Runner implementations:
-
runners/direct-java
-
Direct Runner (local execution)
-
runners/flink/
-
Flink Runner
-
runners/spark/
-
Spark Runner
-
runners/google-cloud-dataflow-java/
-
Dataflow Runner
-
examples/java/
-
Java examples including WordCount
Build System
Apache Beam uses Gradle with a custom BeamModulePlugin . Every Java project's build.gradle starts with:
apply plugin: 'org.apache.beam.module' applyJavaNature( ... )
Common Commands
Build Commands
Compile a specific project
./gradlew -p sdks/java/core compileJava
Build a project (compile + tests)
./gradlew :sdks:java:harness:build
Run WordCount example
./gradlew :examples:java:wordCount
Running Unit Tests
Run all tests in a project
./gradlew :sdks:java:harness:test
Run a specific test class
./gradlew :sdks:java:harness:test --tests org.apache.beam.fn.harness.CachesTest
Run tests matching a pattern
./gradlew :sdks:java:harness:test --tests *CachesTest
Run a specific test method
./gradlew :sdks:java:harness:test --tests *CachesTest.testClearableCache
Running Integration Tests
Integration tests have filenames ending in IT.java and use TestPipeline .
Run I/O integration tests on Direct Runner
./gradlew :sdks:java:io:google-cloud-platform:integrationTest
Run with custom GCP project
./gradlew :sdks:java:io:google-cloud-platform:integrationTest
-PgcpProject=<project> -PgcpTempRoot=gs://<bucket>/path
Run on Dataflow Runner
./gradlew :runners:google-cloud-dataflow-java:examplesJavaRunnerV2IntegrationTest
-PdisableSpotlessCheck=true -PdisableCheckStyle=true -PskipCheckerFramework
-PgcpProject=<project> -PgcpRegion=us-central1 -PgcsTempRoot=gs://<bucket>/tmp
Code Formatting
Format Java code
./gradlew spotlessApply
Writing Integration Tests
@Rule public TestPipeline pipeline = TestPipeline.create();
@Test public void testSomething() { pipeline.apply(...); pipeline.run().waitUntilFinish(); }
Set pipeline options via -DbeamTestPipelineOptions='[...]' :
-DbeamTestPipelineOptions='["--runner=TestDataflowRunner","--project=myproject","--region=us-central1","--stagingLocation=gs://bucket/path"]'
Using Modified Beam Code
Publish to Maven Local
Publish a specific module
./gradlew -Ppublishing -p sdks/java/io/kafka publishToMavenLocal
Publish all modules
./gradlew -Ppublishing publishToMavenLocal
Building SDK Container
Build Java SDK container (for Runner v2)
./gradlew :sdks:java:container:java11:docker
Tag and push
docker tag apache/beam_java11_sdk:2.XX.0.dev
"us-docker.pkg.dev/your-project/beam/beam_java11_sdk:custom"
docker push "us-docker.pkg.dev/your-project/beam/beam_java11_sdk:custom"
Building Dataflow Worker Jar
./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar
Test Naming Conventions
-
Unit tests: *Test.java
-
Integration tests: *IT.java
JUnit Report Location
After running tests, find HTML reports at: <project>/build/reports/tests/test/index.html
IDE Setup (IntelliJ)
-
Open /beam (the repository root, NOT sdks/java )
-
Wait for indexing to complete
-
Find examples/java/build.gradle and click Run next to wordCount task to verify setup