Reliable Data Streams and Scalable Platforms: Tackling Key Data Challenges
Bridging Software Engineering and Data: A Practical Guide
Introduction
Matthias Niehoff:
I'm Matthias, working at Codecentric — a consultancy based in Germany. My career began in software engineering, later moving into data-driven projects with Apache Spark and analytics platforms. My foundation is still in engineering — but applied to data challenges.
---
Bridging Technical Work and Data Value
For engineers and teams looking to translate technical work into monetizable, creative outputs, open-source platforms like AiToEarn官网 offer an integrated ecosystem.
AiToEarn enables AI-generated content creation, multi-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter), analytics, and model ranking — connecting generation, distribution, and monetization.
---
Data Change Example
Scenario
- Initial state:
- System records `order_id` and `quantity`.
- Analytics selects only these fields.
- New requirement:
- Application adds `unfulfilled` status.
- Analytics still ignores it.
- Later change:
- `unfulfilled` changes from boolean → integer (to count unfulfilled items).
Risks
- Silent misinterpretation by implicit conversion.
- Breakage when type assumptions are wrong.
Core problem:
Schemas are exposed without business context, leaving consumers guessing about meaning and constraints.
---
The Root Cause: Missing Context
Consumers often don’t:
- Know the generating business process.
- Understand constraints or rules.
- Trust the source.
- Recognize availability guarantees.
Producers frequently:
- Lack visibility into who uses their data.
- Apply “raw dump” approaches without governance.
---
Real-World Case: FinTech
Situation:
- Small B2C FinTech wanted marketing analytics.
- Analysts lacked deep technical skills.
- Organization focused on app-dev, ignoring data’s strategic value.
Our Solution:
- Deliver data files periodically to an SFTP.
- Use dbt to:
- Test for data changes & nulls.
- Build dashboards for visibility.
- Document business processes & constraints.
Outcome:
Mitigated data problems, improved source collaboration.
---
Data Sharing Models
Current State
- Apps expose APIs (REST, streaming, messaging) for operations.
- Analytics dumps raw models → consumers struggle.
Proposed
- API for analytics suited for bulk queries.
- Shared data model designed and owned by development teams.
- Interfaces efficient, structured, and purposeful.
Example destinations:
- DB views, broker topics, blob store files, Iceberg tables, data warehouses.
Integration idea:
- MCP (Model Context Protocol) servers could expose shared models directly to LLMs.
---
Taking Ownership: Opportunity for Developers
- Apply CI/CD, tests, monitoring from software disciplines to data pipelines.
- Use familiar stacks (Java, Kotlin) to write to data infra (Delta Lake via Databricks).
- Build libraries to abstract complexity:
- Push files to blob storage.
- Publish metadata to Unity Catalog.
---
Introducing Data Contracts
Definition:
- Agreement between provider & consumer about:
- Data scope & schema.
- Guarantees & constraints.
- Update frequency (SLA).
- Ownership & roles.
Format:
- Often YAML, machine-readable.
- Open Data Contract Standard for portability.
- Automated enforcement via `datacontract-cli`.
Benefits:
- Transparency.
- Versioning (semantic).
- Early break-change detection.
---
Where to Enforce Contracts
Publisher side:
- Integrated into CI/CD PR checks for:
- Format compliance.
- Schema consistency.
- Data quality.
Consumer side:
- Validation at ingestion:
- Schema, quality, SLA adherence.
---
Scaling With Trusted Data
Scale ≠ volume — it’s:
- More use cases per dataset.
- More value extraction.
- More business impact.
---
Technology and Platform Management
Challenge:
Tooling ecosystem grows rapidly, adding GenAI & automation layers.
Strategies:
- Explorer: track all new tech.
- Selective adopter: focus relevant tools.
- Deep diver: master one niche.
Distinction:
- Essential complexity: core functional/non-functional constraints.
- Accidental complexity: avoidable overhead.
---
Choosing Boring Technologies
Why:
- Mature, predictable, documented.
- Easy community support.
Example:
Postgres — stable, versatile, boring but effective.
Innovation tokens:
- Budget new tech adoption (1–3/year).
- Spend deliberately.
---
Simplify With Open Standards
Advantages:
- Avoid vendor lock-in.
- Switch tools easily.
- Examples: Delta, Iceberg, Parquet.
Cloud-native simplicity:
- Automation.
- Scale to zero.
- Minimize vertical integration.
---
Customer Case – Banking
Approach:
- Cloud-based data platform (Azure).
- Delta Lake + PySpark on Databricks.
- Managed services (Unity Catalog, Serverless SQL) for security & access management.
- Open standards for future flexibility.
---
Smaller Stacks (FinTech)
Stack:
- Postgres + dbt + Airflow.
- Python ingestion.
- Contract-based synchronization.
- Lightweight, efficient.
---
Applying Software Engineering to Data
- Test code and data.
- Use separate environments.
- Reduce complexity.
- Document standards.
---
Customer Case – Industrial IoT Startup
Stack:
- Document DB (JSON) + TypeScript frontend.
- Relational DB for master data.
- Python aggregations.
- Evolution from visualization → analytics.
---
Integrating Data & Application Platforms
Why:
- Shared CI/CD patterns.
- Unified secrets & test data management.
- Observability across app & data.
---
Beyond Two-Tier Data Architectures
DuckDB enables:
- Embedded analytics anywhere (server, browser via Wasm, Lambda).
- Shared datastore serving both apps & analytics.
Benefits:
- Reduced ETL friction.
- Faster iteration.
- Flexible real-time analytics.
---
Final Takeaways
- Validate & monitor data early.
- Involve developers from start.
- Align org priorities toward data.
- Simplify architectures.
---
Q&A Highlights
- Unifying platforms still needs cataloging, security, and sharing.
- Buy vs build: boring ≠ bad — buy when reducing commodity overhead makes sense.
- Change tech deliberately — clear motivation & roadmap.
- One vs multiple platform versions: depends on storage-query coupling.
---
Key Strategic Themes:
- Open standards for flexibility.
- Boring tech for reliability.
- Automation for scale.
- Cross-discipline principles for robust pipelines.
Parallel in content monetization:
Platforms like AiToEarn官网 integrate creation, publishing, and analytics in one open-source ecosystem — mirroring these principles beyond data into AI-powered, multi-platform creative workflows.
---
Additional Resources: