SELECT is a keyword of the SQL Language (specifically dml) that permits to read data from the database.
A SELECT Statement is called a query.
The select statement has three capabilities:
- projection (choose a subset of the columns in the table)
- and joining
The Sqoop utility for instance, uses a splitting column to split the workload.
By default, it identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range.
For example, with the following parameters
- primary key column minimum value was 0
- primary key column maximum value was 1000
- 4 threads
You would get the following threads:
|1||SELECT * FROM sometable WHERE id >= 0 AND id < 250|
|2||SELECT * FROM sometable WHERE id >= 250 AND id < 500|
|3||SELECT * FROM sometable WHERE id >= 500 AND id < 750|
|4||SELECT * FROM sometable WHERE id >= 750 AND id < 1001|
The primary key values must be uniformly distributed across its range, otherwise this can result in unbalanced threads.