| Property | Meaning | Usage |
|---|---|---|
table
|
The BigQuery table in the format [[project:]dataset.]table.
It is recommended to use the path parameter of
load()/save() instead. This option has been
deprecated and will be removed in a future version.
(Deprecated) |
Read/Write |
dataset
|
The dataset containing the table. This option should be used with
standard table and views, but not when loading query results.
(Optional unless omitted in table)
|
Read/Write |
project
|
The Google Cloud Project ID of the table. This option should be used with
standard table and views, but not when loading query results.
(Optional. Defaults to the project of the Service Account being used) |
Read/Write |
billingProject
|
The Google Cloud Project ID to use for billing (API calls, query execution).
(Optional. Defaults to the project of the Service Account being used) |
Read/Write |
parentProject
|
(Deprecated) Alias for billingProject.
(Optional. Defaults to the project of the Service Account being used) |
Read/Write |
location
|
The BigQuery location where the data resides (e.g. US, EU, asia-northeast1).
(Optional. Defaults to BigQuery default) |
Read/Write |
maxParallelism
|
The maximal number of partitions to split the data into. Actual number
may be less if BigQuery deems the data small enough. If there are not
enough executors to schedule a reader per partition, some partitions may
be empty.
Important: The old parameter ( parallelism) is
still supported but in deprecated mode. It will ve removed in
version 1.0 of the connector.
(Optional. Defaults to the larger of the preferredMinParallelism and 20,000).) |
Read |
preferredMinParallelism
|
The preferred minimal number of partitions to split the data into. Actual number
may be less if BigQuery deems the data small enough. If there are not
enough executors to schedule a reader per partition, some partitions may
be empty.
(Optional. Defaults to the smallest of 3 times the application's default parallelism and maxParallelism.) |
Read |
viewsEnabled
|
Enables the connector to read from views and not only tables. Please read
the relevant section before activating
this option.
(Optional. Defaults to false)
|
Read |
readDataFormat
|
Data Format for reading from BigQuery. Options : ARROW, AVRO
(Optional. Defaults to ARROW)
|
Read |
optimizedEmptyProjection
|
The connector uses an optimized empty projection (select without any
columns) logic, used for count() execution. This logic takes
the data directly from the table metadata or performs a much efficient
`SELECT COUNT(*) WHERE...` in case there is a filter. You can cancel the
use of this logic by setting this option to false.
(Optional, defaults to true)
|
Read |
pushAllFilters
|
If set to true, the connector pushes all the filters Spark can delegate
to BigQuery Storage API. This reduces amount of data that needs to be sent from
BigQuery Storage API servers to Spark clients. This option has been
deprecated and will be removed in a future version.
(Optional, defaults to true)
(Deprecated) |
Read |
bigQueryJobLabel
|
Can be used to add labels to the connector initiated query and load
BigQuery jobs. Multiple labels can be set.
(Optional) |
Read |
bigQueryTableLabel
|
Can be used to add labels to the table while writing to a table. Multiple
labels can be set.
(Optional) |
Write |
traceApplicationName
|
Application name used to trace BigQuery Storage read and write sessions.
Setting the application name is required to set the trace ID on the
sessions.
(Optional) |
Read |
traceJobId
|
Job ID used to trace BigQuery Storage read and write sessions.
(Optional, defaults to the Dataproc job ID is exists, otherwise uses the Spark application ID) |
Read |
createDisposition
|
Specifies whether the job is allowed to create new tables. The permitted
values are:
(Optional. Default to CREATE_IF_NEEDED). |
Write |
writeMethod
|
Controls the method
in which the data is written to BigQuery. Available values are direct
to use the BigQuery Storage Write API and indirect which writes the
data first to GCS and then triggers a BigQuery load operation. See more
here
(Optional, defaults to indirect)
|
Write |
writeAtLeastOnce
|
Guarantees that data is written to BigQuery at least once. This is a lesser
guarantee than exactly once. This is suitable for streaming scenarios
in which data is continuously being written in small batches.
(Optional. Defaults to false)
Supported only by the `DIRECT` write method and mode is NOT `Overwrite`. |
Write |
temporaryGcsBucket
|
The GCS bucket that temporarily holds the data before it is loaded to
BigQuery. Required unless set in the Spark configuration
(spark.conf.set(...)).
Defaults to the `fs.gs.system.bucket` if exists, for example on Google Cloud Dataproc clusters, starting version 0.42.0. Supported only by the `INDIRECT` write method. |
Write |
persistentGcsBucket
|
The GCS bucket that holds the data before it is loaded to
BigQuery. If informed, the data won't be deleted after write data
into BigQuery.
Supported only by the `INDIRECT` write method. |
Write |
persistentGcsPath
|
The GCS path that holds the data before it is loaded to
BigQuery. Used only with persistentGcsBucket.
Not supported by the `DIRECT` write method. |
Write |
intermediateFormat
|
The format of the data before it is loaded to BigQuery, values can be
either "parquet","orc" or "avro". In order to use the Avro format, the
spark-avro package must be added in runtime.
(Optional. Defaults to parquet). On write only. Supported only for the `INDIRECT` write method.
|
Write |
useAvroLogicalTypes
|
When loading from Avro (`.option("intermediateFormat", "avro")`), BigQuery uses the underlying Avro types instead of the logical types [by default](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#logical_types). Supplying this option converts Avro logical types to their corresponding BigQuery data types.
(Optional. Defaults to false). On write only.
|
Write |
datePartition
|
The date partition the data is going to be written to. Should be a date string
given in the format YYYYMMDD. Can be used to overwrite the data of
a single partition, like this:
(Optional). On write only. Can also be used with different partition types like: HOUR: YYYYMMDDHH
MONTH: YYYYMM
YEAR: YYYY
Not supported by the `DIRECT` write method. |
Write |
partitionField
|
If this field is specified, the table is partitioned by this field.
For Time partitioning, specify together with the option `partitionType`. For Integer-range partitioning, specify together with the 3 options: `partitionRangeStart`, `partitionRangeEnd, `partitionRangeInterval`. The field must be a top-level TIMESTAMP or DATE field for Time partitioning, or INT64 for Integer-range partitioning. Its mode must be NULLABLE or REQUIRED. If the option is not set for a Time partitioned table, then the table will be partitioned by pseudo column, referenced via either '_PARTITIONTIME' as TIMESTAMP type, or
'_PARTITIONDATE' as DATE type.
(Optional). Not supported by the `DIRECT` write method. |
Write |
partitionExpirationMs
|
Number of milliseconds for which to keep the storage for partitions in the table.
The storage in a partition will have an expiration time of its partition time plus this value.
(Optional). Not supported by the `DIRECT` write method. |
Write |
partitionType
|
Used to specify Time partitioning.
Supported types are: HOUR, DAY, MONTH, YEAR
This option is mandatory for a target table to be Time partitioned. (Optional. Defaults to DAY if PartitionField is specified). Not supported by the `DIRECT` write method. |
Write |
partitionRangeStart,
partitionRangeEnd,
partitionRangeInterval
|
Used to specify Integer-range partitioning.
These options are mandatory for a target table to be Integer-range partitioned. All 3 options must be specified. Not supported by the `DIRECT` write method. |
Write |
clusteredFields
|
A string of non-repeated, top level columns seperated by comma.
(Optional). |
Write |
allowFieldAddition
|
Adds the ALLOW_FIELD_ADDITION
SchemaUpdateOption to the BigQuery LoadJob. Allowed values are true and false.
(Optional. Default to false).
Supported only by the `INDIRECT` write method. |
Write |
allowFieldRelaxation
|
Adds the ALLOW_FIELD_RELAXATION
SchemaUpdateOption to the BigQuery LoadJob. Allowed values are true and false.
(Optional. Default to false).
Supported only by the `INDIRECT` write method. |
Write |
proxyAddress
|
Address of the proxy server. The proxy must be a HTTP proxy and address should be in the `host:port` format.
Can be alternatively set in the Spark configuration (spark.conf.set(...)) or in Hadoop
Configuration (fs.gs.proxy.address).
(Optional. Required only if connecting to GCP via proxy.) |
Read/Write |
proxyUsername
|
The userName used to connect to the proxy. Can be alternatively set in the Spark configuration
(spark.conf.set(...)) or in Hadoop Configuration (fs.gs.proxy.username).
(Optional. Required only if connecting to GCP via proxy with authentication.) |
Read/Write |
proxyPassword
|
The password used to connect to the proxy. Can be alternatively set in the Spark configuration
(spark.conf.set(...)) or in Hadoop Configuration (fs.gs.proxy.password).
(Optional. Required only if connecting to GCP via proxy with authentication.) |
Read/Write |
httpMaxRetry
|
The maximum number of retries for the low-level HTTP requests to BigQuery. Can be alternatively set in the
Spark configuration (spark.conf.set("httpMaxRetry", ...)) or in Hadoop Configuration
(fs.gs.http.max.retry).
(Optional. Default is 10) |
Read/Write |
httpConnectTimeout
|
The timeout in milliseconds to establish a connection with BigQuery. Can be alternatively set in the
Spark configuration (spark.conf.set("httpConnectTimeout", ...)) or in Hadoop Configuration
(fs.gs.http.connect-timeout).
(Optional. Default is 60000 ms. 0 for an infinite timeout, a negative number for 20000) |
Read/Write |
httpReadTimeout
|
The timeout in milliseconds to read data from an established connection. Can be alternatively set in the
Spark configuration (spark.conf.set("httpReadTimeout", ...)) or in Hadoop Configuration
(fs.gs.http.read-timeout).
(Optional. Default is 60000 ms. 0 for an infinite timeout, a negative number for 20000) |
Read |
arrowCompressionCodec
|
Compression codec while reading from a BigQuery table when using Arrow format. Options :
ZSTD (Zstandard compression),
LZ4_FRAME (https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md),
COMPRESSION_UNSPECIFIED. The recommended compression codec is ZSTD
while using Java.
(Optional. Defaults to COMPRESSION_UNSPECIFIED which means no compression will be used)
|
Read |
responseCompressionCodec
|
Compression codec used to compress the ReadRowsResponse data. Options:
RESPONSE_COMPRESSION_CODEC_UNSPECIFIED,
RESPONSE_COMPRESSION_CODEC_LZ4
(Optional. Defaults to RESPONSE_COMPRESSION_CODEC_UNSPECIFIED which means no compression will be used)
|
Read |
cacheExpirationTimeInMinutes
|
The expiration time of the in-memory cache storing query information.
To disable caching, set the value to 0. (Optional. Defaults to 15 minutes) |
Read |
enableModeCheckForSchemaFields
|
Checks the mode of every field in destination schema to be equal to the mode in corresponding source field schema, during DIRECT write.
Default value is true i.e., the check is done by default. If set to false the mode check is ignored. |
Write | enableListInference
|
Indicates whether to use schema inference specifically when the mode is Parquet (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#parquetoptions).
Defaults to false. |
Write |
bqChannelPoolSize |
The (fixed) size of the gRPC channel pool created by the BigQueryReadClient.
For optimal performance, this should be set to at least the number of cores on the cluster executors. |
Read |
createReadSessionTimeoutInSeconds
|
The timeout in seconds to create a ReadSession when reading a table.
For Extremely large table this value should be increased. (Optional. Defaults to 600 seconds) |
Read |
queryJobPriority
|
Priority levels set for the job while reading data from BigQuery query. The permitted values are:
(Optional. Defaults to INTERACTIVE)
|
Read/Write |
destinationTableKmsKeyName
|
Describes the Cloud KMS encryption key that will be used to protect destination BigQuery
table. The BigQuery Service Account associated with your project requires access to this
encryption key. for further Information about using CMEK with BigQuery see
[here](https://cloud.google.com/bigquery/docs/customer-managed-encryption#key_resource_id).
Notice: The table will be encrypted by the key only if it created by the connector. A pre-existing unencrypted table won't be encrypted just by setting this option. (Optional) |
Write |
allowMapTypeConversion
|
Boolean config to disable conversion from BigQuery records to Spark MapType
when the record has two subfields with field names as key and value.
Default value is true which allows the conversion.
(Optional) |
Read |
spark.sql.sources.partitionOverwriteMode
|
Config to specify the overwrite mode on write when the table is range/time partitioned.
Currently supportd two modes : STATIC and DYNAMIC. In STATIC mode,
the entire table is overwritten. In DYNAMIC mode, the data is overwritten by partitions of the existing table.
The default value is STATIC.
(Optional) |
Write |
enableReadSessionCaching
|
Boolean config to disable read session caching. Caches BigQuery read sessions to allow for faster Spark query planning.
Default value is true.
(Optional) |
Read |
readSessionCacheDurationMins
|
Config to set the read session caching duration in minutes. Only works if enableReadSessionCaching is true (default).
Allows specifying the duration to cache read sessions for. Maximum allowed value is 300.
Default value is 5.
(Optional) |
Read |
bigQueryJobTimeoutInMinutes
|
Config to set the BigQuery job timeout in minutes.
Default value is 360 minutes.
(Optional) |
Read/Write |
snapshotTimeMillis
|
A timestamp specified in milliseconds to use to read a table snapshot.
By default this is not set and the latest version of a table is read.
(Optional) |
Read |
bigNumericDefaultPrecision
|
An alternative default precision for BigNumeric fields, as the BigQuery default is too wide for Spark. Values can be between 1 and 38.
This default is used only when the field has an unparameterized BigNumeric type.
Please note that there might be data loss if the actual data's precision is more than what is specified.
(Optional) |
Read/Write |
bigNumericDefaultScale
|
An alternative default scale for BigNumeric fields. Values can be between 0 and 38, and less than bigNumericFieldsPrecision.
This default is used only when the field has an unparameterized BigNumeric type.
Please note that there might be data loss if the actual data's scale is more than what is specified.
(Optional) |
Read/Write |
credentialsScopes
|
Replaces the scopes of the Google Credentials if the credentials type supports that.
If scope replacement is not supported then it does nothing.
The value should be a comma separated list of valid scopes. (Optional) |
Read/Write |
| BigQuery Standard SQL Data Type | Spark SQL
Data Type |
Notes |
BOOL
|
BooleanType
|
|
INT64
|
LongType
|
|
FLOAT64
|
DoubleType
|
|
NUMERIC
|
DecimalType
|
Please refer to Numeric and BigNumeric support |
BIGNUMERIC
|
DecimalType
|
Please refer to Numeric and BigNumeric support |
STRING
|
StringType
|
|
BYTES
|
BinaryType
|
|
STRUCT
|
StructType
|
|
ARRAY
|
ArrayType
|
|
TIMESTAMP
|
TimestampType
|
|
DATE
|
DateType
|
|
DATETIME
|
StringType, TimestampNTZType*
|
Spark has no DATETIME type.
Spark string can be written to an existing BQ DATETIME column provided it is in the format for BQ DATETIME literals. * For Spark 3.4+, BQ DATETIME is read as Spark's TimestampNTZ type i.e. java LocalDateTime |
TIME
|
LongType, StringType*
|
Spark has no TIME type. The generated longs, which indicate microseconds since midnight can be safely cast to TimestampType, but this causes the date to be inferred as the current day. Thus times are left as longs and user can cast if they like.
When casting to Timestamp TIME have the same TimeZone issues as DATETIME * Spark string can be written to an existing BQ TIME column provided it is in the format for BQ TIME literals. |
JSON
|
StringType
|
Spark has no JSON type. The values are read as String. In order to write JSON back to BigQuery, the following conditions are REQUIRED:
|
ARRAY<STRUCT<key,value>>
|
MapType
|
BigQuery has no MAP type, therefore similar to other conversions like Apache Avro and BigQuery Load jobs, the connector converts a Spark Map to a REPEATED STRUCT<key,value>.
This means that while writing and reading of maps is available, running a SQL on BigQuery that uses map semantics is not supported.
To refer to the map's values using BigQuery SQL, please check the BigQuery documentation.
Due to these incompatibilities, a few restrictions apply:
|
[maxParallelism](#properties) and [preferredMinParallelism](#properties) can be configured explicitly to control the number of partitions.
## Tagging BigQuery Resources
In order to support tracking the usage of BigQuery resources the connectors
offers the following options to tag BigQuery resources:
### Adding BigQuery Jobs Labels
The connector can launch BigQuery load and query jobs. Adding labels to the jobs
is done in the following manner:
```
spark.conf.set("bigQueryJobLabel.cost_center", "analytics")
spark.conf.set("bigQueryJobLabel.usage", "nightly_etl")
```
This will create labels `cost_center`=`analytics` and `usage`=`nightly_etl`.
### Adding BigQuery Storage Trace ID
Used to annotate the read and write sessions. The trace ID is of the format
`Spark:ApplicationName:JobID`. This is an opt-in option, and to use it the user
need to set the `traceApplicationName` property. JobID is auto generated by the
Dataproc job ID, with a fallback to the Spark application ID (such as
`application_1648082975639_0001`). The Job ID can be overridden by setting the
`traceJobId` option. Notice that the total length of the trace ID cannot be over
256 characters.
## Using in Jupyter Notebooks
The connector can be used in [Jupyter notebooks](https://jupyter.org/) even if
it is not installed on the Spark cluster. It can be added as an external jar in
using the following code:
**Python:**
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:${next-release-tag}") \
.getOrCreate()
df = spark.read.format("bigquery") \
.load("dataset.table")
```
**Scala:**
```scala
val spark = SparkSession.builder
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:${next-release-tag}")
.getOrCreate()
val df = spark.read.format("bigquery")
.load("dataset.table")
```
In case Spark cluster is using Scala 2.12 (it's optional for Spark 2.4.x,
mandatory in 3.0.x), then the relevant package is
com.google.cloud.spark:spark-bigquery-with-dependencies_**2.12**:${next-release-tag}. In
order to know which Scala version is used, please run the following code:
**Python:**
```python
spark.sparkContext._jvm.scala.util.Properties.versionString()
```
**Scala:**
```python
scala.util.Properties.versionString
```
## Compiling against the connector
Unless you wish to use the implicit Scala API `spark.read.bigquery("TABLE_ID")`, there is no need to compile against the connector.
To include the connector in your project:
### Maven
```xml
| Metric Name | Description |
|---|---|
bytes read |
number of BigQuery bytes read |
rows read |
number of BigQuery rows read |
scan time |
the amount of time spent between read rows response requested to obtained across all the executors, in milliseconds. |
parse time |
the amount of time spent for parsing the rows read across all the executors, in milliseconds. |
spark time |
the amount of time spent in spark to process the queries (i.e., apart from scanning and parsing), across all the executors, in milliseconds. |
coalesce on the DataFrame to mitigate this problem.
```
desiredPartitionCount = 5
dfNew = df.coalesce(desiredPartitionCount)
dfNew.write
```
A rule of thumb is to have a single partition handle at least 1GB of data.
Also note that a job running with the `writeAtLeastOnce` property turned on will not encounter CreateWriteStream
quota errors.
### How do I authenticate outside GCE / Dataproc?
The connector needs an instance of a GoogleCredentials in order to connect to the BigQuery APIs. There are multiple
options to provide it:
* The default is to load the JSON key from the `GOOGLE_APPLICATION_CREDENTIALS` environment variable, as described
[here](https://cloud.google.com/docs/authentication/getting-started).
* In case the environment variable cannot be changed, the credentials file can be configured as
as a spark option. The file should reside on the same path on all the nodes of the cluster.
```
// Globally
spark.conf.set("credentialsFile", "")
// Per read/Write
spark.read.format("bigquery").option("credentialsFile", "")
```
* Credentials can also be provided explicitly, either as a parameter or from Spark runtime configuration.
They should be passed in as a base64-encoded string directly.
```
// Globally
spark.conf.set("credentials", "| Property | Meaning | Usage |
|---|---|---|
table
|
The BigQuery table in the format [[project:]dataset.]table.
It is recommended to use the path parameter of
load()/save() instead. This option has been
deprecated and will be removed in a future version.
(Deprecated) |
Read/Write |
dataset
|
The dataset containing the table. This option should be used with
standard table and views, but not when loading query results.
(Optional unless omitted in table)
|
Read/Write |
project
|
The Google Cloud Project ID of the table. This option should be used with
standard table and views, but not when loading query results.
(Optional. Defaults to the project of the Service Account being used) |
Read/Write |
billingProject
|
The Google Cloud Project ID to use for billing (API calls, query execution).
(Optional. Defaults to the project of the Service Account being used) |
Read/Write |
parentProject
|
(Deprecated) Alias for billingProject.
(Optional. Defaults to the project of the Service Account being used) |
Read/Write |
location
|
The BigQuery location where the data resides (e.g. US, EU, asia-northeast1).
(Optional. Defaults to BigQuery default) |
Read/Write |
maxParallelism
|
The maximal number of partitions to split the data into. Actual number
may be less if BigQuery deems the data small enough. If there are not
enough executors to schedule a reader per partition, some partitions may
be empty.
Important: The old parameter ( parallelism) is
still supported but in deprecated mode. It will ve removed in
version 1.0 of the connector.
(Optional. Defaults to the larger of the preferredMinParallelism and 20,000).) |
Read |
preferredMinParallelism
|
The preferred minimal number of partitions to split the data into. Actual number
may be less if BigQuery deems the data small enough. If there are not
enough executors to schedule a reader per partition, some partitions may
be empty.
(Optional. Defaults to the smallest of 3 times the application's default parallelism and maxParallelism.) |
Read |
viewsEnabled
|
Enables the connector to read from views and not only tables. Please read
the relevant section before activating
this option.
(Optional. Defaults to false)
|
Read |
readDataFormat
|
Data Format for reading from BigQuery. Options : ARROW, AVRO
(Optional. Defaults to ARROW)
|
Read |
optimizedEmptyProjection
|
The connector uses an optimized empty projection (select without any
columns) logic, used for count() execution. This logic takes
the data directly from the table metadata or performs a much efficient
`SELECT COUNT(*) WHERE...` in case there is a filter. You can cancel the
use of this logic by setting this option to false.
(Optional, defaults to true)
|
Read |
pushAllFilters
|
If set to true, the connector pushes all the filters Spark can delegate
to BigQuery Storage API. This reduces amount of data that needs to be sent from
BigQuery Storage API servers to Spark clients. This option has been
deprecated and will be removed in a future version.
(Optional, defaults to true)
(Deprecated) |
Read |
bigQueryJobLabel
|
Can be used to add labels to the connector initiated query and load
BigQuery jobs. Multiple labels can be set.
(Optional) |
Read |
bigQueryTableLabel
|
Can be used to add labels to the table while writing to a table. Multiple
labels can be set.
(Optional) |
Write |
traceApplicationName
|
Application name used to trace BigQuery Storage read and write sessions.
Setting the application name is required to set the trace ID on the
sessions.
(Optional) |
Read |
traceJobId
|
Job ID used to trace BigQuery Storage read and write sessions.
(Optional, defaults to the Dataproc job ID is exists, otherwise uses the Spark application ID) |
Read |
createDisposition
|
Specifies whether the job is allowed to create new tables. The permitted
values are:
(Optional. Default to CREATE_IF_NEEDED). |
Write |
writeMethod
|
Controls the method
in which the data is written to BigQuery. Available values are direct
to use the BigQuery Storage Write API and indirect which writes the
data first to GCS and then triggers a BigQuery load operation. See more
here
(Optional, defaults to indirect)
|
Write |
writeAtLeastOnce
|
Guarantees that data is written to BigQuery at least once. This is a lesser
guarantee than exactly once. This is suitable for streaming scenarios
in which data is continuously being written in small batches.
(Optional. Defaults to false)
Supported only by the `DIRECT` write method and mode is NOT `Overwrite`. |
Write |
temporaryGcsBucket
|
The GCS bucket that temporarily holds the data before it is loaded to
BigQuery. Required unless set in the Spark configuration
(spark.conf.set(...)).
Defaults to the `fs.gs.system.bucket` if exists, for example on Google Cloud Dataproc clusters, starting version 0.42.0. Supported only by the `INDIRECT` write method. |
Write |
persistentGcsBucket
|
The GCS bucket that holds the data before it is loaded to
BigQuery. If informed, the data won't be deleted after write data
into BigQuery.
Supported only by the `INDIRECT` write method. |
Write |
persistentGcsPath
|
The GCS path that holds the data before it is loaded to
BigQuery. Used only with persistentGcsBucket.
Not supported by the `DIRECT` write method. |
Write |
intermediateFormat
|
The format of the data before it is loaded to BigQuery, values can be
either "parquet","orc" or "avro". In order to use the Avro format, the
spark-avro package must be added in runtime.
(Optional. Defaults to parquet). On write only. Supported only for the `INDIRECT` write method.
|
Write |
useAvroLogicalTypes
|
When loading from Avro (`.option("intermediateFormat", "avro")`), BigQuery uses the underlying Avro types instead of the logical types [by default](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#logical_types). Supplying this option converts Avro logical types to their corresponding BigQuery data types.
(Optional. Defaults to false). On write only.
|
Write |
datePartition
|
The date partition the data is going to be written to. Should be a date string
given in the format YYYYMMDD. Can be used to overwrite the data of
a single partition, like this:
(Optional). On write only. Can also be used with different partition types like: HOUR: YYYYMMDDHH
MONTH: YYYYMM
YEAR: YYYY
Not supported by the `DIRECT` write method. |
Write |
partitionField
|
If this field is specified, the table is partitioned by this field.
For Time partitioning, specify together with the option `partitionType`. For Integer-range partitioning, specify together with the 3 options: `partitionRangeStart`, `partitionRangeEnd, `partitionRangeInterval`. The field must be a top-level TIMESTAMP or DATE field for Time partitioning, or INT64 for Integer-range partitioning. Its mode must be NULLABLE or REQUIRED. If the option is not set for a Time partitioned table, then the table will be partitioned by pseudo column, referenced via either '_PARTITIONTIME' as TIMESTAMP type, or
'_PARTITIONDATE' as DATE type.
(Optional). Not supported by the `DIRECT` write method. |
Write |
partitionExpirationMs
|
Number of milliseconds for which to keep the storage for partitions in the table.
The storage in a partition will have an expiration time of its partition time plus this value.
(Optional). Not supported by the `DIRECT` write method. |
Write |
partitionType
|
Used to specify Time partitioning.
Supported types are: HOUR, DAY, MONTH, YEAR
This option is mandatory for a target table to be Time partitioned. (Optional. Defaults to DAY if PartitionField is specified). Not supported by the `DIRECT` write method. |
Write |
partitionRangeStart,
partitionRangeEnd,
partitionRangeInterval
|
Used to specify Integer-range partitioning.
These options are mandatory for a target table to be Integer-range partitioned. All 3 options must be specified. Not supported by the `DIRECT` write method. |
Write |
clusteredFields
|
A string of non-repeated, top level columns seperated by comma.
(Optional). |
Write |
allowFieldAddition
|
Adds the ALLOW_FIELD_ADDITION
SchemaUpdateOption to the BigQuery LoadJob. Allowed values are true and false.
(Optional. Default to false).
Supported only by the `INDIRECT` write method. |
Write |
allowFieldRelaxation
|
Adds the ALLOW_FIELD_RELAXATION
SchemaUpdateOption to the BigQuery LoadJob. Allowed values are true and false.
(Optional. Default to false).
Supported only by the `INDIRECT` write method. |
Write |
proxyAddress
|
Address of the proxy server. The proxy must be a HTTP proxy and address should be in the `host:port` format.
Can be alternatively set in the Spark configuration (spark.conf.set(...)) or in Hadoop
Configuration (fs.gs.proxy.address).
(Optional. Required only if connecting to GCP via proxy.) |
Read/Write |
proxyUsername
|
The userName used to connect to the proxy. Can be alternatively set in the Spark configuration
(spark.conf.set(...)) or in Hadoop Configuration (fs.gs.proxy.username).
(Optional. Required only if connecting to GCP via proxy with authentication.) |
Read/Write |
proxyPassword
|
The password used to connect to the proxy. Can be alternatively set in the Spark configuration
(spark.conf.set(...)) or in Hadoop Configuration (fs.gs.proxy.password).
(Optional. Required only if connecting to GCP via proxy with authentication.) |
Read/Write |
httpMaxRetry
|
The maximum number of retries for the low-level HTTP requests to BigQuery. Can be alternatively set in the
Spark configuration (spark.conf.set("httpMaxRetry", ...)) or in Hadoop Configuration
(fs.gs.http.max.retry).
(Optional. Default is 10) |
Read/Write |
httpConnectTimeout
|
The timeout in milliseconds to establish a connection with BigQuery. Can be alternatively set in the
Spark configuration (spark.conf.set("httpConnectTimeout", ...)) or in Hadoop Configuration
(fs.gs.http.connect-timeout).
(Optional. Default is 60000 ms. 0 for an infinite timeout, a negative number for 20000) |
Read/Write |
httpReadTimeout
|
The timeout in milliseconds to read data from an established connection. Can be alternatively set in the
Spark configuration (spark.conf.set("httpReadTimeout", ...)) or in Hadoop Configuration
(fs.gs.http.read-timeout).
(Optional. Default is 60000 ms. 0 for an infinite timeout, a negative number for 20000) |
Read |
arrowCompressionCodec
|
Compression codec while reading from a BigQuery table when using Arrow format. Options :
ZSTD (Zstandard compression),
LZ4_FRAME (https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md),
COMPRESSION_UNSPECIFIED. The recommended compression codec is ZSTD
while using Java.
(Optional. Defaults to COMPRESSION_UNSPECIFIED which means no compression will be used)
|
Read |
responseCompressionCodec
|
Compression codec used to compress the ReadRowsResponse data. Options:
RESPONSE_COMPRESSION_CODEC_UNSPECIFIED,
RESPONSE_COMPRESSION_CODEC_LZ4
(Optional. Defaults to RESPONSE_COMPRESSION_CODEC_UNSPECIFIED which means no compression will be used)
|
Read |
cacheExpirationTimeInMinutes
|
The expiration time of the in-memory cache storing query information.
To disable caching, set the value to 0. (Optional. Defaults to 15 minutes) |
Read |
enableModeCheckForSchemaFields
|
Checks the mode of every field in destination schema to be equal to the mode in corresponding source field schema, during DIRECT write.
Default value is true i.e., the check is done by default. If set to false the mode check is ignored. |
Write | enableListInference
|
Indicates whether to use schema inference specifically when the mode is Parquet (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#parquetoptions).
Defaults to false. |
Write |
bqChannelPoolSize |
The (fixed) size of the gRPC channel pool created by the BigQueryReadClient.
For optimal performance, this should be set to at least the number of cores on the cluster executors. |
Read |
createReadSessionTimeoutInSeconds
|
The timeout in seconds to create a ReadSession when reading a table.
For Extremely large table this value should be increased. (Optional. Defaults to 600 seconds) |
Read |
queryJobPriority
|
Priority levels set for the job while reading data from BigQuery query. The permitted values are:
(Optional. Defaults to INTERACTIVE)
|
Read/Write |
destinationTableKmsKeyName
|
Describes the Cloud KMS encryption key that will be used to protect destination BigQuery
table. The BigQuery Service Account associated with your project requires access to this
encryption key. for further Information about using CMEK with BigQuery see
[here](https://cloud.google.com/bigquery/docs/customer-managed-encryption#key_resource_id).
Notice: The table will be encrypted by the key only if it created by the connector. A pre-existing unencrypted table won't be encrypted just by setting this option. (Optional) |
Write |
allowMapTypeConversion
|
Boolean config to disable conversion from BigQuery records to Spark MapType
when the record has two subfields with field names as key and value.
Default value is true which allows the conversion.
(Optional) |
Read |
spark.sql.sources.partitionOverwriteMode
|
Config to specify the overwrite mode on write when the table is range/time partitioned.
Currently supportd two modes : STATIC and DYNAMIC. In STATIC mode,
the entire table is overwritten. In DYNAMIC mode, the data is overwritten by partitions of the existing table.
The default value is STATIC.
(Optional) |
Write |
enableReadSessionCaching
|
Boolean config to disable read session caching. Caches BigQuery read sessions to allow for faster Spark query planning.
Default value is true.
(Optional) |
Read |
readSessionCacheDurationMins
|
Config to set the read session caching duration in minutes. Only works if enableReadSessionCaching is true (default).
Allows specifying the duration to cache read sessions for. Maximum allowed value is 300.
Default value is 5.
(Optional) |
Read |
bigQueryJobTimeoutInMinutes
|
Config to set the BigQuery job timeout in minutes.
Default value is 360 minutes.
(Optional) |
Read/Write |
snapshotTimeMillis
|
A timestamp specified in milliseconds to use to read a table snapshot.
By default this is not set and the latest version of a table is read.
(Optional) |
Read |
bigNumericDefaultPrecision
|
An alternative default precision for BigNumeric fields, as the BigQuery default is too wide for Spark. Values can be between 1 and 38.
This default is used only when the field has an unparameterized BigNumeric type.
Please note that there might be data loss if the actual data's precision is more than what is specified.
(Optional) |
Read/Write |
bigNumericDefaultScale
|
An alternative default scale for BigNumeric fields. Values can be between 0 and 38, and less than bigNumericFieldsPrecision.
This default is used only when the field has an unparameterized BigNumeric type.
Please note that there might be data loss if the actual data's scale is more than what is specified.
(Optional) |
Read/Write |
credentialsScopes
|
Replaces the scopes of the Google Credentials if the credentials type supports that.
If scope replacement is not supported then it does nothing.
The value should be a comma separated list of valid scopes. (Optional) |
Read/Write |
| BigQuery Standard SQL Data Type | Spark SQL
Data Type |
Notes |
BOOL
|
BooleanType
|
|
INT64
|
LongType
|
|
FLOAT64
|
DoubleType
|
|
NUMERIC
|
DecimalType
|
Please refer to Numeric and BigNumeric support |
BIGNUMERIC
|
DecimalType
|
Please refer to Numeric and BigNumeric support |
STRING
|
StringType
|
|
BYTES
|
BinaryType
|
|
STRUCT
|
StructType
|
|
ARRAY
|
ArrayType
|
|
TIMESTAMP
|
TimestampType
|
|
DATE
|
DateType
|
|
DATETIME
|
StringType, TimestampNTZType*
|
Spark has no DATETIME type.
Spark string can be written to an existing BQ DATETIME column provided it is in the format for BQ DATETIME literals. * For Spark 3.4+, BQ DATETIME is read as Spark's TimestampNTZ type i.e. java LocalDateTime |
TIME
|
LongType, StringType*
|
Spark has no TIME type. The generated longs, which indicate microseconds since midnight can be safely cast to TimestampType, but this causes the date to be inferred as the current day. Thus times are left as longs and user can cast if they like.
When casting to Timestamp TIME have the same TimeZone issues as DATETIME * Spark string can be written to an existing BQ TIME column provided it is in the format for BQ TIME literals. |
JSON
|
StringType
|
Spark has no JSON type. The values are read as String. In order to write JSON back to BigQuery, the following conditions are REQUIRED:
|
ARRAY<STRUCT<key,value>>
|
MapType
|
BigQuery has no MAP type, therefore similar to other conversions like Apache Avro and BigQuery Load jobs, the connector converts a Spark Map to a REPEATED STRUCT<key,value>.
This means that while writing and reading of maps is available, running a SQL on BigQuery that uses map semantics is not supported.
To refer to the map's values using BigQuery SQL, please check the BigQuery documentation.
Due to these incompatibilities, a few restrictions apply:
|
[maxParallelism](#properties) and [preferredMinParallelism](#properties) can be configured explicitly to control the number of partitions.
## Tagging BigQuery Resources
In order to support tracking the usage of BigQuery resources the connectors
offers the following options to tag BigQuery resources:
### Adding BigQuery Jobs Labels
The connector can launch BigQuery load and query jobs. Adding labels to the jobs
is done in the following manner:
```
spark.conf.set("bigQueryJobLabel.cost_center", "analytics")
spark.conf.set("bigQueryJobLabel.usage", "nightly_etl")
```
This will create labels `cost_center`=`analytics` and `usage`=`nightly_etl`.
### Adding BigQuery Storage Trace ID
Used to annotate the read and write sessions. The trace ID is of the format
`Spark:ApplicationName:JobID`. This is an opt-in option, and to use it the user
need to set the `traceApplicationName` property. JobID is auto generated by the
Dataproc job ID, with a fallback to the Spark application ID (such as
`application_1648082975639_0001`). The Job ID can be overridden by setting the
`traceJobId` option. Notice that the total length of the trace ID cannot be over
256 characters.
## Using in Jupyter Notebooks
The connector can be used in [Jupyter notebooks](https://jupyter.org/) even if
it is not installed on the Spark cluster. It can be added as an external jar in
using the following code:
**Python:**
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.44.1") \
.getOrCreate()
df = spark.read.format("bigquery") \
.load("dataset.table")
```
**Scala:**
```scala
val spark = SparkSession.builder
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.44.1")
.getOrCreate()
val df = spark.read.format("bigquery")
.load("dataset.table")
```
In case Spark cluster is using Scala 2.12 (it's optional for Spark 2.4.x,
mandatory in 3.0.x), then the relevant package is
com.google.cloud.spark:spark-bigquery-with-dependencies_**2.12**:0.44.1. In
order to know which Scala version is used, please run the following code:
**Python:**
```python
spark.sparkContext._jvm.scala.util.Properties.versionString()
```
**Scala:**
```python
scala.util.Properties.versionString
```
## Compiling against the connector
Unless you wish to use the implicit Scala API `spark.read.bigquery("TABLE_ID")`, there is no need to compile against the connector.
To include the connector in your project:
### Maven
```xml
| Metric Name | Description |
|---|---|
bytes read |
number of BigQuery bytes read |
rows read |
number of BigQuery rows read |
scan time |
the amount of time spent between read rows response requested to obtained across all the executors, in milliseconds. |
parse time |
the amount of time spent for parsing the rows read across all the executors, in milliseconds. |
spark time |
the amount of time spent in spark to process the queries (i.e., apart from scanning and parsing), across all the executors, in milliseconds. |
coalesce on the DataFrame to mitigate this problem.
```
desiredPartitionCount = 5
dfNew = df.coalesce(desiredPartitionCount)
dfNew.write
```
A rule of thumb is to have a single partition handle at least 1GB of data.
Also note that a job running with the `writeAtLeastOnce` property turned on will not encounter CreateWriteStream
quota errors.
### How do I authenticate outside GCE / Dataproc?
The connector needs an instance of a GoogleCredentials in order to connect to the BigQuery APIs. There are multiple
options to provide it:
* The default is to load the JSON key from the `GOOGLE_APPLICATION_CREDENTIALS` environment variable, as described
[here](https://cloud.google.com/docs/authentication/getting-started).
* In case the environment variable cannot be changed, the credentials file can be configured as
as a spark option. The file should reside on the same path on all the nodes of the cluster.
```
// Globally
spark.conf.set("credentialsFile", "")
// Per read/Write
spark.read.format("bigquery").option("credentialsFile", "")
```
* Credentials can also be provided explicitly, either as a parameter or from Spark runtime configuration.
They should be passed in as a base64-encoded string directly.
```
// Globally
spark.conf.set("credentials", "public TableInfo createTable(TableId tableId, Schema schema) { return
* createTable(tableId, schema, Optional.empty()); }
*/
/**
* Creates an empty table in BigQuery.
*
* @param tableId The TableId of the table to be created.
* @param schema The Schema of the table to be created.
* @param options Allows configuring the created table
* @return The {@code Table} object representing the table that was created.
*/
public TableInfo createTable(TableId tableId, Schema schema, CreateTableOptions options) {
StandardTableDefinition.Builder tableDefinition =
StandardTableDefinition.newBuilder().setSchema(schema);
options
.getClusteredFields()
.ifPresent(
clusteredFields ->
tableDefinition.setClustering(
Clustering.newBuilder().setFields(clusteredFields).build()));
TableInfo.Builder tableInfo = TableInfo.newBuilder(tableId, tableDefinition.build());
options
.getKmsKeyName()
.ifPresent(
keyName ->
tableInfo.setEncryptionConfiguration(
EncryptionConfiguration.newBuilder().setKmsKeyName(keyName).build()));
if (!options.getBigQueryTableLabels().isEmpty()) {
tableInfo.setLabels(options.getBigQueryTableLabels());
}
return bigQuery.create(tableInfo.build());
}
/**
* Creates a temporary table with a job to cleanup after application end, and the same location as
* the destination table; the temporary table will have the same name as the destination table,
* with the current time in milliseconds appended to it; useful for holding temporary data in
* order to overwrite the destination table.
*
* @param destinationTableId The TableId of the eventual destination for the data going into the
* temporary table.
* @param schema The Schema of the destination / temporary table.
* @return The {@code Table} object representing the created temporary table.
*/
public TableInfo createTempTable(TableId destinationTableId, Schema schema) {
TableId tempTableId = createTempTableId(destinationTableId);
TableInfo tableInfo =
TableInfo.newBuilder(tempTableId, StandardTableDefinition.of(schema)).build();
TableInfo tempTable = bigQuery.create(tableInfo);
CLEANUP_JOBS.add(() -> deleteTable(tempTable.getTableId()));
return tempTable;
}
public TableInfo createTempTableAfterCheckingSchema(
TableId destinationTableId, Schema schema, boolean enableModeCheckForSchemaFields)
throws IllegalArgumentException {
TableInfo destinationTable = getTable(destinationTableId);
Schema tableSchema = destinationTable.getDefinition().getSchema();
ComparisonResult schemaWritableResult =
BigQueryUtil.schemaWritable(
schema, // sourceSchema
tableSchema, // destinationSchema
false, // regardFieldOrder
enableModeCheckForSchemaFields);
Preconditions.checkArgument(
schemaWritableResult.valuesAreEqual(),
new BigQueryConnectorException.InvalidSchemaException(
"Destination table's schema is not compatible with dataframe's schema. "
+ schemaWritableResult.makeMessage()));
return createTempTable(destinationTableId, schema);
}
public TableId createTempTableId(TableId destinationTableId) {
String tempProject = materializationProject.orElseGet(destinationTableId::getProject);
String tempDataset = materializationDataset.orElseGet(destinationTableId::getDataset);
String tableName = destinationTableId.getTable() + System.nanoTime();
TableId tempTableId =
tempProject == null
? TableId.of(tempDataset, tableName)
: TableId.of(tempProject, tempDataset, tableName);
return tempTableId;
}
/**
* Deletes this table in BigQuery.
*
* @param tableId The TableId of the table to be deleted.
* @return True if the operation was successful, false otherwise.
*/
public boolean deleteTable(TableId tableId) {
log.info("Deleting table " + fullTableName(tableId));
return bigQuery.delete(tableId);
}
private Job copyData(
TableId sourceTableId,
TableId destinationTableId,
JobInfo.WriteDisposition writeDisposition) {
String queryFormat = "SELECT * FROM `%s`";
String temporaryTableName = fullTableName(sourceTableId);
String sqlQuery = String.format(queryFormat, temporaryTableName);
QueryJobConfiguration queryConfig =
jobConfigurationFactory
.createQueryJobConfigurationBuilder(sqlQuery, Collections.emptyMap())
.setUseLegacySql(false)
.setDestinationTable(destinationTableId)
.setWriteDisposition(writeDisposition)
.build();
return create(JobInfo.newBuilder(queryConfig).build());
}
public boolean isTablePartitioned(TableId tableId) {
TableInfo table = getTable(tableId);
if (table == null) {
return false;
}
TableDefinition tableDefinition = table.getDefinition();
if (tableDefinition instanceof StandardTableDefinition) {
StandardTableDefinition sdt = (StandardTableDefinition) tableDefinition;
return sdt.getTimePartitioning() != null || sdt.getRangePartitioning() != null;
}
return false;
}
/**
* Overwrites the partitions of the destination table, using the partitions from the given
* temporary table, transactionally.
*
* @param temporaryTableId The {@code TableId} representing the temporary-table.
* @param destinationTableId The {@code TableId} representing the destination table.
* @return The {@code Job} object representing this operation (which can be tracked to wait until
* it has finished successfully).
*/
public Job overwriteDestinationWithTemporaryDynamicPartitons(
TableId temporaryTableId, TableId destinationTableId) {
TableDefinition destinationDefinition = getTable(destinationTableId).getDefinition();
String sqlQuery = null;
if (destinationDefinition instanceof StandardTableDefinition) {
String destinationTableName = fullTableName(destinationTableId);
String temporaryTableName = fullTableName(temporaryTableId);
StandardTableDefinition sdt = (StandardTableDefinition) destinationDefinition;
TimePartitioning timePartitioning = sdt.getTimePartitioning();
if (timePartitioning != null) {
sqlQuery =
getQueryForTimePartitionedTable(
destinationTableName, temporaryTableName, sdt, timePartitioning);
} else {
RangePartitioning rangePartitioning = sdt.getRangePartitioning();
if (rangePartitioning != null) {
sqlQuery =
getQueryForRangePartitionedTable(
destinationTableName, temporaryTableName, sdt, rangePartitioning);
}
}
if (sqlQuery != null) {
QueryJobConfiguration queryConfig =
jobConfigurationFactory
.createQueryJobConfigurationBuilder(sqlQuery, Collections.emptyMap())
.setUseLegacySql(false)
.build();
return create(JobInfo.newBuilder(queryConfig).build());
}
}
// no partitioning default to statndard overwrite
return overwriteDestinationWithTemporary(temporaryTableId, destinationTableId);
}
/**
* Overwrites the given destination table, with all the data from the given temporary table,
* transactionally.
*
* @param temporaryTableId The {@code TableId} representing the temporary-table.
* @param destinationTableId The {@code TableId} representing the destination table.
* @return The {@code Job} object representing this operation (which can be tracked to wait until
* it has finished successfully).
*/
public Job overwriteDestinationWithTemporary(
TableId temporaryTableId, TableId destinationTableId) {
String queryFormat =
"MERGE `%s`\n"
+ "USING (SELECT * FROM `%s`)\n"
+ "ON FALSE\n"
+ "WHEN NOT MATCHED THEN INSERT ROW\n"
+ "WHEN NOT MATCHED BY SOURCE THEN DELETE";
String destinationTableName = fullTableName(destinationTableId);
String temporaryTableName = fullTableName(temporaryTableId);
String sqlQuery = String.format(queryFormat, destinationTableName, temporaryTableName);
QueryJobConfiguration queryConfig =
jobConfigurationFactory
.createQueryJobConfigurationBuilder(sqlQuery, Collections.emptyMap())
.setUseLegacySql(false)
.build();
return create(JobInfo.newBuilder(queryConfig).build());
}
/**
* Appends all the data from the given temporary table, to the given destination table,
* transactionally.
*
* @param temporaryTableId The {@code TableId} representing the temporary-table.
* @param destinationTableId The {@code TableId} representing the destination table.
* @return The {@code Job} object representing this operation (which can be tracked to wait until
* it has finished successfully).
*/
public Job appendDestinationWithTemporary(TableId temporaryTableId, TableId destinationTableId) {
return copyData(temporaryTableId, destinationTableId, JobInfo.WriteDisposition.WRITE_APPEND);
}
/**
* Creates a String appropriately formatted for BigQuery Storage Write API representing the given
* table.
*
* @param tableId The {@code TableId} representing the given object.
* @return The formatted String.
*/
public String createTablePathForBigQueryStorage(TableId tableId) {
Preconditions.checkNotNull(tableId, "tableId cannot be null");
// We need the full path for the createWriteStream method. We used to have it by creating the
// table and then taking its full tableId, but that caused an issue with the ErrorIfExists
// implementation (now the check, done in another place is positive). To solve it, we do what
// the BigQuery client does on Table ID with no project - take the BigQuery client own project
// ID. This gives us the same behavior but allows us to defer the table creation to the last
// minute.
String project = tableId.getProject() != null ? tableId.getProject() : getProjectId();
return String.format(
"projects/%s/datasets/%s/tables/%s", project, tableId.getDataset(), tableId.getTable());
}
public TableInfo getReadTable(ReadTableOptions options) {
Optional listTables(DatasetId datasetId, TableDefinition.Type... types) {
Set
allTables = bigQuery.listTables(datasetId).iterateAll();
return StreamSupport.stream(allTables.spliterator(), false)
.filter(table -> allowedTypes.contains(table.getDefinition().getType()))
.collect(ImmutableList.toImmutableList());
}
public Table update(TableInfo table) {
return bigQuery.update(table);
}
public Job createAndWaitFor(JobConfiguration.Builder jobConfiguration) {
return createAndWaitFor(jobConfiguration.build());
}
public Job createAndWaitFor(JobConfiguration jobConfiguration) {
JobInfo jobInfo = JobInfo.of(jobConfiguration);
Job job = bigQuery.create(jobInfo);
Job returnedJob = null;
log.info("Submitted job {}. jobId: {}", jobConfiguration, job.getJobId());
try {
Job completedJob = job.waitFor();
if (completedJob == null) {
throw new BigQueryException(
BaseHttpServiceException.UNKNOWN_CODE,
String.format("Failed to run the job [%s], got null back", job));
}
if (completedJob.getStatus().getError() != null) {
throw new BigQueryException(
BaseHttpServiceException.UNKNOWN_CODE,
String.format(
"Failed to run the job [%s], due to '%s'", completedJob.getStatus().getError()));
}
return completedJob;
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new BigQueryException(
BaseHttpServiceException.UNKNOWN_CODE,
String.format("Failed to run the job [%s], task was interrupted", job),
e);
}
}
Job create(JobInfo jobInfo) {
return bigQuery.create(jobInfo);
}
public TableResult query(String sql) {
try {
return bigQuery.query(
jobConfigurationFactory
.createQueryJobConfigurationBuilder(sql, Collections.emptyMap())
.build());
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new BigQueryException(
BaseHttpServiceException.UNKNOWN_CODE,
String.format("Failed to run the query [%s]", sql),
e);
}
}
String createSql(
TableId table,
ImmutableList