Data Flows

Understanding data flows and their relationship to data models

Data Flows Guide

Data flows represent the movement and transformation of data through systems. Understanding how data flows relate to data models is essential for designing effective data architectures and ensuring data quality throughout the pipeline.

What Are Data Flows?

Data flows describe:

  • Data Movement: How data moves between systems and processes
  • Transformations: How data is transformed at each step
  • Dependencies: Relationships between data sources and destinations
  • Contracts: Data contracts that govern each flow

Data Flows and Data Models

Data Flows Diagram

Data flows are closely related to data models:

  • Data Models define the structure and constraints of data
  • Data Flows define how data models are used and transformed
  • Contracts ensure data quality at each flow step

Components of a Data Flow

Sources

Data sources where flows originate:

  • Databases: SQL databases, NoSQL stores
  • APIs: REST APIs, GraphQL endpoints
  • Files: CSV, JSON, Parquet files
  • Streams: Kafka topics, event streams

Transformations

Operations performed on data:

  • Filtering: Remove unwanted records
  • Mapping: Transform field names and values
  • Aggregation: Combine and summarize data
  • Validation: Ensure data meets contract requirements

Destinations

Where data flows end up:

  • Data Warehouses: Analytics and reporting systems
  • Applications: Business applications and services
  • APIs: External systems and integrations
  • Storage: Long-term data storage

Creating Data Flows

Step 1: Define Source Model

Start by defining the data model for your source:

sourceOrder {
  order_id: string
  customer_name: string
  order_date: date
  items: sourceOrderItem[]
}
```text

### Step 2: Define Destination Model

Define the target data model:

```text
targetOrder {
  id: string
  customerId: string
  orderDate: date
  lineItems: orderLineItem[]
  totalAmount: number
}
```text

### Step 3: Define Transformation

Map source to destination:

```text
flow: orderTransformation {
  source: sourceOrder
  destination: targetOrder

  transformations {
    id: sourceOrder.order_id
    customerId: lookupCustomer(sourceOrder.customer_name)
    orderDate: sourceOrder.order_date
    lineItems: transformItems(sourceOrder.items)
    totalAmount: calculateTotal(sourceOrder.items)
  }
}
```text

### Step 4: Apply Contracts

Ensure data quality at each step:

```text
flow: orderTransformation {
  source: sourceOrder @contract(source-order-contract)
  destination: targetOrder @contract(target-order-contract)

  validate {
    source: validateSource()
    destination: validateDestination()
  }
}
```text

## Data Flow Patterns

### ETL (Extract, Transform, Load)

Traditional batch processing:

1. **Extract**: Read data from source
2. **Transform**: Apply transformations
3. **Load**: Write to destination

### Streaming

Real-time data processing:

1. **Ingest**: Receive streaming data
2. **Process**: Transform in real-time
3. **Output**: Send to downstream systems

### Change Data Capture (CDC)

Capture and propagate changes:

1. **Detect**: Identify data changes
2. **Capture**: Record changes
3. **Propagate**: Send changes to destinations

## Linking Flows to Models

### Contract-Based Validation

Each flow step should have a data contract:

```text
flow: customerSync {
  source: customerDB @contract(customer-db-contract)
  transform: customerMapping @contract(customer-transform-contract)
  destination: customerAPI @contract(customer-api-contract)
}
```text

### Model Evolution

Handle schema changes in flows:

```text
flow: orderProcessing {
  source: orderSource @contract(order-v1-contract)
  transform: orderTransform @contract(order-v2-contract)
  destination: orderTarget @contract(order-v2-contract)

  compatibility: backward
}
```text

## Best Practices

### Design Principles

- **Explicit Contracts**: Define contracts for each flow step
- **Validation**: Validate data at each transformation point
- **Documentation**: Document flow logic and transformations
- **Monitoring**: Monitor flow health and data quality

### Performance

- **Optimize Transformations**: Minimize data processing overhead
- **Batch Processing**: Group operations when possible
- **Caching**: Cache lookups and transformations
- **Parallel Processing**: Process independent flows in parallel

### Error Handling

- **Validation Errors**: Handle contract validation failures
- **Transformation Errors**: Handle transformation failures gracefully
- **Retry Logic**: Implement retry for transient failures
- **Dead Letter Queues**: Route failed records for investigation

## Visualizing Data Flows

### Flow Diagrams

Create visual representations of data flows:

- **Source Systems**: Where data originates
- **Transformation Steps**: How data is transformed
- **Destination Systems**: Where data ends up
- **Contracts**: Data contracts at each step

### Documentation

Document flows with:

- **Flow Descriptions**: What the flow does
- **Model Mappings**: How source maps to destination
- **Transformation Logic**: Business rules and logic
- **Error Handling**: How errors are handled

## Integration with Open Data Modelling

### Creating Flow Definitions

Define data flows using Open Data Modelling:

```javascript
import { DataFlow, DataModel } from "@opendatamodelling/sdk";

const sourceModel = new DataModel("source-order", sourceSchema);
const targetModel = new DataModel("target-order", targetSchema);

const flow = new DataFlow({
  name: "order-transformation",
  source: sourceModel,
  destination: targetModel,
  transformations: transformationRules,
});
```text

### Validating Flows

Ensure flows comply with contracts:

```javascript
const isValid = await flow.validate({
  sourceContract: sourceContract,
  destinationContract: targetContract,
});
```text

## Related Resources

- [Crowsfeat Notation](/guides/crowsfeat/) - Data structure notation
- [Data Contracts](/data-contracts/) - Understanding data contracts
- <a href="https://bitol-io.github.io/open-data-contract-standard/latest/" target="_blank" rel="noopener noreferrer">ODCS Documentation</a> - Standard format documentation
- <a href="https://bitol-io.github.io/open-data-product-standard" target="_blank" rel="noopener noreferrer">ODPS Documentation</a> - Open Data Product Standard