Real-Time Oracle Database to Google BigQuery: A GoldenGate 23ai How-To Guide
For data engineers and architects, enabling real-time data flow from transactional Oracle databases into analytical platforms like Google BigQuery is a crucial task. This guide outlines the technical configuration steps for achieving this with Oracle GoldenGate 23ai, focusing on the core components and their specific settings for robust, high-performance replication.
Understanding the Three-Tier Replication Architecture
Oracle GoldenGate’s real-time data replication relies on a robust three-tier architecture. Each tier plays a distinct role in ensuring transactional consistency and efficient data movement:
- Extract: Captures transactional changes directly from the Oracle database’s redo logs.
- Distribution Service: Securely and efficiently moves the captured data (trail files) across the network.
- Replicat: Applies the changes to the target database, in this case, Google BigQuery.
Let’s delve into the specific configurations for each component.
Step 1: Configuring the Oracle Extract Process
The Extract process is responsible for capturing Data Manipulation Language (DML) and Data Definition Language (DDL) changes from your source Oracle database. For optimal performance and real-time capture, an Integrated Extract is recommended.
Here’s a sample configuration for an Extract named `EXTTSCSF` capturing from `FREEPDB1.CTMS_PSO.TEST_1`:
EXTRACT EXTTSCSF
USERIDALIAS SOURCE_TSC DOMAIN OracleGoldenGate
EXTTRAIL WW
REPORTCOUNT EVERY 2 MINUTES, RATE
WARNLONGTRANS 30MIN CHECKINTERVAL 10MIN
TRANLOGOPTIONS INTEGRATEDPARAMS (MAX_SGA_SIZE 512, PARALLELISM 2)
NOCOMPRESSUPDATES
TABLE FREEPDB1.CTMS_PSO.TEST_1;
Key Parameter Explanations:
- USERIDALIAS SOURCE_TSC DOMAIN OracleGoldenGate: Specifies the credential alias used to connect securely to the source Oracle database. This alias should be pre-configured in the GoldenGate credential store.
- EXTTRAIL WW: Defines the two-character trail file identifier (`WW` in this case) for the Extract. These trail files contain the captured change data.
- REPORTCOUNT EVERY 2 MINUTES, RATE: Configures the Extract to report its processing statistics (e.g., records processed, throughput rate) every two minutes. This is vital for monitoring performance.
- WARNLONGTRANS 30MIN CHECKINTERVAL 10MIN: Sets a warning threshold for long-running transactions. If a transaction is open for more than 30 minutes, GoldenGate will log a warning, with checks performed every 10 minutes. This helps identify potential issues that could delay data delivery.
- *TRANLOGOPTIONS INTEGRATEDPARAMS (MAX_SGA_SIZE 512, PARALLELISM 2): Specifies parameters for the Integrated Extract.
- MAX_SGA_SIZE 512: Allocates 512MB of SGA memory for the capture process, optimizing performance.
- PARALLELISM 2: Enables parallel capture threads for improved throughput.
- NOCOMPRESSUPDATES: Crucial for BigQuery integration. This ensures that the Extract captures the full “before” and “after” images of updated rows. BigQuery’s append-optimized storage needs both images to correctly handle update operations via `delete-insert` or similar strategies at the Replicat level.
- TABLE FREEPDB1.CTMS_PSO.TEST_1;: Specifies the schema and table for which changes should be captured. You can list multiple tables or use wildcards (`*`) for entire schemas.
The output of the Extract process is the `WW` trail file, which contains the captured transaction data.
Step 2: Configuring the Distribution Service
The Distribution Service is responsible for securely and reliably transmitting the generated trail files from the source GoldenGate instance to the target GoldenGate instance, where the Replicat process will consume them. GoldenGate 23ai leverages a WebSocket (`ws://`) protocol for streaming, replacing older file-transfer methods.
While the exact Distribution Service configuration is performed within the GoldenGate Service Manager, the key elements are defining the source and target URIs for the data stream.
Distribution Service Configuration (Conceptual):
The Distribution Service needs to know where to find the source trail and where to send it. This is typically configured via source and target URIs:
- Source URI: `trail://<source_gg_host>:<source_manager_port>/services/v2/sources?trail=WW`
- This points to the specific trail file (`WW`) managed by the source GoldenGate instance’s Service Manager.
- Target URI: `ws://<target_gg_host>:<target_manager_port>/services/v2/targets?trail=WW`
- This establishes a WebSocket connection to the target GoldenGate instance’s Service Manager, directing the data to a specific trail on the target for the Replicat to read.
The Distribution Service handles network optimizations, buffering, and ensures data integrity during transit.
Step 3: Configuring the BigQuery Replicat Process
The Replicat process applies the changes from the trail files to the target BigQuery dataset. This involves reading the GoldenGate trail, transforming the data into a format suitable for BigQuery, and using the BigQuery Streaming Insert API to load the data.
Replicat Parameter File Configuration (`REPBQ`):
REPLICAT REPBQ
REPERROR(DEFAULT, ABEND)
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
MAXTRANSOPS 20000
SOURCECATALOG FREEPDB1; <— Prevents an error
MAPEXCLUDE GGATE.HEARTBEAT;
MAPEXCLUDE CTMS_PSO.SLASH_TESTING;
MAP CTMS_PSO.TEST_1, TARGET CTMS_PSO.TEST_1;
Key Parameter Explanations:
- REPERROR(DEFAULT, ABEND): Configures the Replicat to `ABEND` (abnormally end) if it encounters a data error that cannot be resolved automatically. This “fail-fast” approach is critical for data integrity, preventing corrupted data from being applied to BigQuery.
- REPORTCOUNT EVERY 1 MINUTES, RATE: Provides granular reporting on Replicat’s progress and throughput every minute, allowing for real-time monitoring of data application.
- GROUPTRANSOPS 10000: Batches up to 10,000 operations into a single transaction before applying them to the target. This optimizes BigQuery streaming inserts by reducing API call overhead.
- MAXTRANSOPS 20000: Sets the maximum number of operations that can be processed within a single transaction group, providing a safeguard against excessively large transactions.
- SOURCECATALOG FREEPDB1;: Specifies the source pluggable database (PDB) name when using a Container Database (CDB) as the source.
- MAPEXCLUDE GGATE.HEARTBEAT;: Excludes specific tables (e.g., GoldenGate heartbeat tables or internal test tables) from being replicated. This ensures only relevant business data is sent to BigQuery.
- MAP CTMS_PSO.TEST_1, TARGET CTMS_PSO.TEST_1;: Defines the mapping between the source Oracle table (`CTMS_PSO.TEST_1`) and the target BigQuery table (`CTMS_PSO.TEST_1`). The target table name can be different if required.
BigQuery Handler Configuration:
The BigQuery handler is configured within the Replicat’s properties file (often `dirprm/REPBQ.properties` or similar, depending on your GoldenGate installation). This configures how GoldenGate connects to and interacts with BigQuery.
gg.handlerlist = bigquery
gg.handler.bigquery.type = bigquery
gg.handler.bigquery.projectId = {project_id}
gg.handler.bigquery.credentialsFile = /path/to/your/service_account_key.json
gg.handler.bigquery.auditLogMode = true
gg.handler.bigquery.pkUpdateHandling = delete-insert
gg.handler.bigquery.metaColumnsTemplate = ${optype}, ${position}
gg.classpath = /opt/app/oracle/23.4.0.24.06/ogghome_1/opt/DependencyDownloader/dependencies/bigquerystreaming_3.9.2/*
Key Handler Parameter Explanations:
- gg.handlerlist = bigquery: Activates the BigQuery handler.
- gg.handler.bigquery.type = bigquery: Specifies the type of handler.
- gg.handler.bigquery.projectId = {project_id}: Your Google Cloud Project ID where the BigQuery dataset resides.
- gg.handler.bigquery.credentialsFile = /path/to/your/service_account_key.json: Path to the Google Cloud service account key JSON file with appropriate permissions to write to BigQuery.
- gg.handler.bigquery.auditLogMode = true: Enables detailed logging of BigQuery operations, which is useful for auditing and debugging.
- gg.handler.bigquery.pkUpdateHandling = delete-insert: **Critical for handling updates in BigQuery.** Since BigQuery is an append-only store, updates are typically handled by inserting a new row and marking the old one as invalid, or by performing a `DELETE` followed by an `INSERT`. This setting configures the handler to perform a `DELETE` then `INSERT` for update operations based on the primary key.
- gg.handler.bigquery.metaColumnsTemplate = ${optype}, ${position}: Adds metadata columns to your BigQuery table.
- ${optype}: The operation type (e.g., `I` for Insert, `U` for Update, `D` for Delete).
- ${position}: The GoldenGate read position, useful for data lineage and ensuring order.
- gg.classpath = /opt/app/oracle/23.4.0.24.06/ogghome_1/opt/DependencyDownloader/dependencies/bigquerystreaming_3.9.2/*`: Specifies the classpath for the necessary BigQuery streaming API JAR files and their dependencies. Ensure these libraries are correctly downloaded and located in the specified path.
Technical Considerations and Best Practices
- Schema Consistency: Ensure that your BigQuery table schema accurately reflects the source Oracle table schema. Pay close attention to data types and precision.
- Primary Keys: Define primary keys on your source Oracle tables. GoldenGate relies on primary keys (or unique keys) for efficient `pkUpdateHandling` in BigQuery.
- Networking: Ensure robust, low-latency network connectivity between your Oracle database server (where Extract runs) and the GoldenGate server (where Replicat runs) that connects to Google Cloud.
- Resource Allocation: Monitor and adjust `MAX_SGA_SIZE`, `PARALLELISM`, `GROUPTRANSOPS`, and `MAXTRANSOPS` based on your transaction volume and network bandwidth to optimize throughput and resource utilization.
- Error Handling: Implement robust error handling strategies. The `REPERROR(DEFAULT, ABEND)` is a good start, but also configure alerts and monitoring for GoldenGate processes to quickly identify and resolve issues.
- Security: Always use service accounts with the least privilege necessary for BigQuery access, and store credentials securely.
- Monitoring: Leverage GoldenGate’s `INFO` and `STATS` commands, `REPORTCOUNT` settings, and BigQuery monitoring tools to keep a close eye on replication latency and performance.
By following these technical steps, you can establish a robust, real-time data pipeline from your Oracle database to Google BigQuery using Oracle GoldenGate 23ai, providing a fresh data foundation for your analytical workloads.
Bobby Curtis
I’m Bobby Curtis and I’m just your normal average guy who has been working in the technology field for awhile (started when I was 18 with the US Army). The goal of this blog has changed a bit over the years. Initially, it was a general blog where I wrote thoughts down. Then it changed to focus on the Oracle Database, Oracle Enterprise Manager, and eventually Oracle GoldenGate.
If you want to follow me on a more timely manner, I can be followed on twitter at @dbasolved or on LinkedIn under “Bobby Curtis MBA”.