Cost drivers: Copy, Data Flow, and Orchestration
Data Factory bills for three primary activity types:
- Copy Activity: Moves data without transformation. Cost scales with DIU-hours (Data Integration Units × hours consumed). DIU are compute slots assigned to the copy task; more DIUs = higher parallelism but higher cost. A simple copy to a managed connector (e.g., SQL Database, Blob Storage) is cheaper than copying to self-managed sources or using custom logic.
- Data Flow (Mapping Data Flow): Performs transformations (filtering, joining, aggregating, writing). Spins up Apache Spark clusters per vCore-hour. Data flows are the heaviest cost component and are often overlooked during design because engineers focus on logical correctness, not cost per GB-to-process.
- Orchestration: Each pipeline run and activity execution are separately billable. A pipeline that triggers 1,000 times per month with 5 activities = thousands of billable orchestration events. This is usually cheap (£0.0001–0.001 per activity run) but accumulates.
Self-hosted integration runtimes add another layer: a standing monthly charge plus per-execution overhead. Using a self-hosted IR for a heavy workload can triple the cost compared to managed runtime.