SSIS - Data Flow (from source to destination)

About

A data flow transfers data from a source to a destination.

Each step in the Data Flow task operates in sequence on a rowset of data as it passes through the pipeline. The data flow engine uses buffers (???) to optimize the rate of flow for the data that is passing through the pipeline.

Articles Related

Component

Source/Destination

A source/destination definition includes:

A connection
A table or file path into which the data must be inserted

Transformation

Row transformations—update column values or create new columns for each row in the data flow
Rowset transformations—create new rowsets
Split and Join transformations—merge or branch data flows
BI transformations—perform BI tasks
- Slowly Changing Dimension,
- Fuzzy Grouping: Uses fuzzy logic to deduplicate rows in the data flow.
- Fuzzy Lookup: Looks up columns in a data source by finding approximate matches for values in the input.
- Term Extraction
- Term Lookup
- Data Mining Query. Runs a data mining prediction query against the input to predict unknown column values.
- Data Cleansing. Applies a Data Quality Services knowledge base to data as it flows through the pipeline.
Custom transformations—perform custom operations

For more info: See Integration Services Transformations.

Performance

Sorting

If subsequent transformations in your data flow rely on sorted data, use the IsSorted property of the output to indicate that the data is already sorted.

Data Flow task properties

DefaultBufferSize and DefaultBufferMaxRows

Configuring the size of the buffers that the data flow uses can significantly improve performance. When there is sufficient memory available, you should try to achieve a small number of large buffers without incurring any disk paging. The default values for these properties are 10 MB and 10,000 rows respectively.

BufferTempStoragePath and BLOBTempStoragePath.

Using these properties to locate temporary objects created by the data flow to a fast disk, or spreading them across multiple storage devices, can improve performance.

EngineThreads

Setting the number of threads available to the Data Flow task can improve execution performance, particularly in packages where the MaxConcurrentExecutables property has been set to enable parallel execution of the package’s tasks across multiple processors.

RunInOptimizedMode

Setting a Data Flow task to run in optimized mode increases performance by removing any columns or components that are not required further downstream in the data flow.