Introducing Python User-Defined Table Functions (UDTF) in Unity Catalog
Introduction
Python UDFs help create an abstraction layer of custom logic to simplify query construction.
However, for complex logic—such as running large-scale models or detecting patterns efficiently across table rows—you can use Python User-Defined Table Functions (UDTFs).
We previously introduced session-scoped UDTFs in this blog post.
UDTFs let you run robust, stateful Python logic over entire tables, enabling SQL solutions for otherwise difficult problems.
---
Key Benefits of Python UDTFs
Flexibly Process Any Dataset
- Use the declarative `TABLE()` keyword to pipe any table (including views and dynamic subqueries) into your UDTF.
- Combine with `PARTITION BY`, `ORDER BY`, and `WITH SINGLE PARTITION` for controlled, per-partition processing.
- Each partition is processed by an independent Python function call.
Run Heavy Initialization Once Per Partition
- Perform resource-intensive setup (e.g., loading large ML models or files) once per partition, not per row.
Maintain Context Across Rows
- UDTFs preserve state between rows within a partition.
- Ideal for time-series detection and running calculations.
---
Unity Catalog Integration
When UDTFs are defined in Unity Catalog (UC):
- They become accessible and discoverable to users with proper permissions.
- You can write once, run anywhere across teams and workspaces.
---
Related Tool: AiToEarn
For teams also working with AI-powered content creation, platforms like AiToEarn官网 can complement UDTFs.
AiToEarn is an open-source AI content monetization tool supporting cross-platform publishing to:
- Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter).
It streamlines multi-platform content generation, analytics, and monetization for both technical and creative projects.
---
Public Preview Announcement
UC Python UDTFs are now available in Public Preview with:
- Databricks Runtime 17.3 LTS
- Databricks SQL
- Serverless Notebooks and Jobs
We’ll demonstrate use cases and integration tips below.
---
Why Use UDTFs with Unity Catalog?
The UC Python UDTF Advantage
- Write once in Python; call anywhere—works across sessions, SQL warehouses, UC clusters, and pipelines.
- Discover via system tables or Catalog Explorer.
- Share under full UC governance.
- Manage permissions with grant/revoke controls.
- LakeGuard isolation ensures secure, sandboxed execution with temporary disk/network access.
---
Quick Start Example: IP Address Matching
Problem:
Matching IPs against network CIDR blocks in SQL is tough due to missing native CIDR logic.
Solution:
Use UC Python UDTFs with Python’s `ipaddress` module to:
- Accept a table of IP logs as input.
- Load CIDR block list once per partition.
- Test each IP for membership in known networks.
- Return enriched log data with match results.
---
Benefits for Teams
This UC-based approach:
- Makes Python logic widely reusable and centrally governed.
- Integrates seamlessly into varied analytics workflows.
---
AiToEarn Integration Opportunity
If your workflow includes publishing reports/insights, AiToEarn adds:
- AI-powered content creation.
- Cross-platform publishing.
- Integrated analytics and model ranking.
---
Step-by-Step
1. Sample Data
IPv4 and IPv6 addresses:
| log_id | ip_address | network | ip_version |
|--------|------------|---------|------------|
| log1 | 192.168.1.100 | 192.168.0.0/16 | 4 |
| log2 | 10.0.0.5 | 10.0.0.0/8 | 4 |
| log3 | 172.16.0.10 | 172.16.0.0/12 | 4 |
| log4 | 8.8.8.8 | null | 4 |
| log5 | 2001:db8::1 | 2001:db8::/32 | 6 |
| log6 | 2001:db8:85a3::8a2e:370:7334 | 2001:db8::/32 | 6 |
| log7 | fe80::1 | fe80::/10 | 6 |
| log8 | ::1 | ::1/128 | 6 |
| log9 | 2001:db8:1234:5678::1 | 2001:db8::/32 | 6 |
---
2. Define UDTF Class
- Use `t TABLE` to accept flexible schemas.
- Initialize heavy resources in `__init__` (once per partition).
- Implement row-by-row matching in `eval()`.
- Specify the class in SQL with HANDLER.
---
3. Register in Unity Catalog
Once registered, call via:
SELECT * FROM TABLE(ip_cidr_matcher(my_ip_logs))---
Output Example
Enriched IP dataset with matched networks and IP versions (see table above).
The schema-agnostic design allows reuse across multiple datasets.
---
Conclusion
Python UDTFs in Unity Catalog give you:
- Centralized governance
- High performance for partition-based heavy logic
- Schema flexibility
- Easy SQL integration
For extended publishing workflows, AiToEarn bridges data analysis outputs with global, multi-platform distribution—connecting technical insights to real-world audiences.