Enterprise Data Catalog Scanner Configuration Guide > Configuring Cloud Resources > Google BigQuery
  

Google BigQuery

You can use the Google BigQuery resource to collect metadata from the assets in Google BigQuery.

Objects Extracted

The Google BigQuery resource extracts metadata from the following assets in a Google BigQuery data source:

Permissions to Configure the Resource

Make sure that you perform the following steps before you configure the Google BigQuery resource:

Connect to a Google BigQuery Data Source Enabled for SSL

By default, all cloud resources are SSL-enabled. To connect to a Google BigQuery data source enabled for SSL, perform the following steps:
  1. 1. Download the Google BigQuery SSL certificates using a web browser.
  2. Note: Make sure that you import the Google Trust Services certificate in the certification path.
  3. 2. Copy the certificates to the <INFA_HOME>/services/shared/security/ directory.
  4. 3. Go to the <INFA_HOME>/source/java/jre/bin directory and run the following keytool command to import each copied certificate as a trusted certificate in the Informatica domain keystore:
  5. keytool -import -file <INFA_HOME>/services/shared/security/<certificate>.cer -alias <alias name> -keystore <INFA_HOME>/services/shared/security/infa_truststore.jks -storepass <Informatica domain keystore password>
Note: If the proxy server used to connect to the data source is SSL enabled, you must download the proxy server certificates on the Informatica domain machine.
You can also import certificates from the Informatica domain truststore. If you are unable to connect to the truststore, run the below command:
$INFAINSTALL/java/jre/bin/keytool -importkeystore -srckeystore $INFAINSTALL/java/jre/lib/security/cacerts -destkeystore $INFAINSTALL/services/shared/security/infa_truststore.jks -srcstorepass changeit -deststorepass pass2038@infaSSL
For custom certificates, you can use the com.infa.ldm.certificates.importer.jar utility packaged here: <INFA_HOME>/services/CatalogService/ScannerBinaries/certificates-importer/certificates-importer.jar
Input the following arguments to the utility jar file:
  1. 1. Directory containing certificates
  2. 2. Path to the truststore
  3. 3. Store pass value
  4. 4. -f
  5. Note: -f is an optional argument. You can provide this argument if you want to overwrite certificates that already exist in the truststore.

Basic Information

The General tab includes the following basic information about the resource:
Information
Description
Name
The name of the resource.
Description
The description of the resource.
Resource type
The type of the resource.
Execute On
You can choose to execute on the default catalog server or offline.

Resource Connection Properties

The General tab includes the following properties:
Property
Description
Project ID
Name of the Google Cloud Platform project that you want to access.
Private Key
The private key associated with the service account.
Client Email
The client email address associated with the service account.
Connect through a proxy server
Proxy server to connect to the data source. Default is Disabled.
Proxy Host
Host name or IP address of the proxy server.
Proxy Port
Port number of the proxy server.
Proxy User Name
Required for authenticated proxy.
Authenticated user name to connect to the proxy server.
Proxy Password
Required for authenticated proxy.
Password for the authenticated user name.
The Metadata Load Settings tab includes the following properties:
Property
Description
Enable Source Metadata
Extracts metadata from the data source.
Scan Hidden Datasets
Extracts metadata from hidden and anonymous datasets.
Dataset
Select the datasets that you want to use to import metadata from Google BigQuery tables in the project. Default is all datasets.
Source Metadata Filter
You can include or exclude tables and views from the resource run. Use semicolons (;) to separate the table names and view names.
For more information about the filter field, see Source Metadata and Data Profile Filter.
Case Sensitive
Specifies that the resource is configured for case sensitivity. Select one of the following values:
  • - True. Select this check box to specify that the resource is configured as case sensitive.
  • - False. Clear this check box to specify that the resource is configured as case insensitive.
The default value is True.
Memory
Specifies the memory required to run the scanner job. Select one of the following values based on the data set size that you plan to import into the catalog:
  • - Low
  • - Medium
  • - High
Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal
Custom Options
JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters:
  • - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO.
  • - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number.
  • - -Dscanner.yarn.app.environment=<key=value>. The key-value pair that you need to set in the Yarn environment. Use a comma to separate the key pair value.
  • - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. The default value is 1.
Track Data Source Changes
View metadata source change notifications in Enterprise Data Catalog.

Identity and Access Management (IAM) Authentication for Google BigQuery

Assign users the following roles to run the profile in hybrid connection mode:
As Google Cloud Storage is used for staging Google BigQuery tables in the hybrid connection mode, you must also assign users with the following storage roles to run the profile:

Sampling Options

You can perform sampling on the following types of tables:
You should use the hybrid connection mode for profiling and extraction of metadata from the Google BigQuery data source. The hybrid connection mode uses standard SQL language to query Google BigQuery. Google BigQuery suggests to use standard SQL language over legacy SQL language. The simple connection mode uses legacy SQL language and must be avoided. Views that are standard SQL compliant are also supported in the hybrid connection mode.
The following table describes the sampling options to determine the number of rows to run the profile job on the Google BigQuery resource:
Sampling Option
Description
Pushdown to Google BigQuery
First N rows
Runs the profile on N number of rows in the resource.
In the Number of First N Sampling Rows field that appears, enter the number of rows to run the profile on.
Note: For optimal performance, consider using this sampling option.
Yes
Sample pushdown query: LIMIT <N>
Random Percentage
Runs the profile on a percentage of data blocks in the data object.
In the Random Percentage field that appears, enter the number of data blocks to run the profile on.
Google BigQuery tables are organized into data blocks. The TABLESAMPLE clause works by randomly selecting a percentage of data blocks from the table and reading all of the rows in the selected blocks. The sampling granularity is limited by the number of data blocks.
Google BigQuery charges you for reading the data that is sampled. Google BigQuery does not cache the results of a query that includes a TABLESAMPLE clause, so each execution incurs the cost of reading the data from storage.
For more information on table sampling in Google BigQuery, see Table Sampling in Google BigQuery.
Yes
Sample pushdown query: TABLESAMPLE SYSTEM (<N> PERCENT)
Limit N rows
Runs the profile based on the number of rows in the data object.
In the Number of Rows to Limit field that appears, enter the number of rows to run the profile on.
The Data Integration Service reads all rows and then runs a sampling algorithm.
No
Auto Random rows
Runs the profile on a random sample of rows. Enterprise Data Catalog computes the number of random rows based on the number of source rows.
The Data Integration Service reads all rows and then runs a sampling algorithm.
No
Random N rows
Runs the profile on the configured number of random rows.
In the Random Sampling Rows field that appears, enter the number of rows that you want to run the profile on.
The Data Integration Service reads all rows and then runs a sampling algorithm.
No
All rows
Runs the profile on all the rows in the data source.
No
Important: You can run a profile on the Google BigQuery resource with reserved keywords in the Google BigQuery table.
Pushdown to the Google BigQuery table does not happen when you run similarity profiling for the Google BigQuery resource with Random Percentage as the sampling option.
Note: The Google BigQuery resource does not support the following features: