About
A data flow transfers data from a source to a destination.
Each step in the Data Flow task operates in sequence on a rowset of data as it passes through the pipeline. The data flow engine uses buffers (???) to optimize the rate of flow for the data that is passing through the pipeline.
Articles Related
Component
Source/Destination
A source/destination definition includes:
- A connection
- A table or file path into which the data must be inserted
Transformation
- Row transformations—update column values or create new columns for each row in the data flow
- Rowset transformations—create new rowsets
- Split and Join transformations—merge or branch data flows
- BI transformations—perform BI tasks
- Slowly Changing Dimension,
- Fuzzy Grouping: Uses fuzzy logic to deduplicate rows in the data flow.
- Fuzzy Lookup: Looks up columns in a data source by finding approximate matches for values in the input.
- Data Mining Query. Runs a data mining prediction query against the input to predict unknown column values.
- Data Cleansing. Applies a Data Quality Services knowledge base to data as it flows through the pipeline.
- Custom transformations—perform custom operations
For more info: See Integration Services Transformations.
Performance
Sorting
If subsequent transformations in your data flow rely on sorted data, use the IsSorted property of the output to indicate that the data is already sorted.
Data Flow task properties
DefaultBufferSize and DefaultBufferMaxRows
Configuring the size of the buffers that the data flow uses can significantly improve performance. When there is sufficient memory available, you should try to achieve a small number of large buffers without incurring any disk paging. The default values for these properties are 10 MB and 10,000 rows respectively.
BufferTempStoragePath and BLOBTempStoragePath.
Using these properties to locate temporary objects created by the data flow to a fast disk, or spreading them across multiple storage devices, can improve performance.
EngineThreads
Setting the number of threads available to the Data Flow task can improve execution performance, particularly in packages where the MaxConcurrentExecutables property has been set to enable parallel execution of the package’s tasks across multiple processors.
RunInOptimizedMode
Setting a Data Flow task to run in optimized mode increases performance by removing any columns or components that are not required further downstream in the data flow.