Building a Trusted and Reproducible Data Ecosystem with OCI

In today’s data-driven landscape, organizations are moving beyond ad-hoc raw data sharing towards a disciplined approach of sharing data packages. This shift parallels the evolution of modern data architecture: the data lakehouse combines the flexibility of data lakes with the reliability of data warehouses. By packaging data with its metadata and integrity information, teams can treat datasets as first-class assets – analogous to software artifacts – that are easier to share, govern, and reuse for analytics and AI. This paper explores how decoupled lakehouse architectures set the stage for data packages, how Open Container Initiative (OCI) artifact standards enable unified distribution of data and models, and how applying software supply chain principles (signatures, attestations, SBOMs) to data pipelines fosters trust, reproducibility, and domain-specific discovery.

Traditional data sharing often meant providing raw files or database dumps with minimal context. Consumers of such raw data struggled to understand origins, schema, or proper usage, leading to misinterpretation and difficulty. Enter data packages: a concept pioneered in open-data communities to bundle raw data with its metadata, making data self-describing and more usable. A data package typically contains the dataset (or a collection of related data files) alongside descriptive metadata – such as schema, definitions, and documentation – in a single portable unit. By including this contextual information, a data package provides clarity about what the data represents, how it was collected or processed, and how it should be interpreted. Data packages transform data into shareable data products that can be distributed much like software packages. Instead of simply handing off a raw CSV or parquet file, a data provider delivers a self-contained package that consumers can load and understand without guesswork - enabling anyone to find out the dataset’s fields, types, and meaning. The result is data that is not only open but also accessible: richly described, versioned, and ready for integration.

Crucially, data packages address the shortcomings of raw dumps by being self-describing, trustable units of data. “Self-describing” means the package itself tells you its schema and provenance; “trustable” implies that the package can be verified and comes from a reliable source. By moving to data packages, organizations lay the groundwork for better governance – every dataset can carry its own documentation, schema, and lineage info – and for analytics and AI projects to more easily consume data without manual prep. In summary, the shift from raw data to data packages represents a maturation in data management: treating data with the same rigor as software, complete with metadata, version control, and quality checks.

Lakehouse Architecture: Decoupled Storage, Compute, and Catalog¶

Supporting the rise of data packages is the modern lakehouse architecture, which provides an open and modular foundation for data management. In a lakehouse, the traditional all-in-one data platform is split into decoupled storage, compute, and catalog components, each independently managed yet integrated to deliver a unified experience. This decoupling is key to enabling flexible data sharing and packaging:

Storage Layer (Data Lake): The lakehouse uses inexpensive, scalable storage (often cloud object stores) to hold raw data files in open formats. Data is stored in a schema-on-read fashion, typically as immutable files (e.g. Parquet) in a flat hierarchy. Because storage is decoupled from compute, it can scale independently and cost-effectively – you can keep petabytes of data in cloud storage without tying it to a particular processing engine. Decoupled storage means that sharing data is as simple as granting access to files or buckets, and multiple analytic engines can read the same data concurrently.
Compute Layer: On a lakehouse, processing frameworks (SQL query engines, DuckDB, Spark, ML tools) are spun up as needed to read/write data from the storage layer. Multiple compute engines can operate on the same dataset, since they all speak the open file formats. Crucially, because compute is separated, one can scale processing power elastically (adding more nodes or using serverless engines) without moving or copying the data. This elasticity and independence of compute facilitates sharing and collaboration – different teams or workloads can run on the same data repository using their tools of choice.
Catalog/Metadata Layer: Complementing storage and compute is the metadata catalog – an independent layer that tracks the schema, table definitions, and transactional metadata of the data in storage. In a lakehouse, the catalog (sometimes a Hive Metastore or a table format like Iceberg/Delta) acts as the unified source of truth about what datasets exist, their schema, and other descriptors. This unified catalog provides the data warehouse-like capabilities on top of the data lake: one can create or alter tables, enforce schemas, and manage transactions, all without tightly coupling to a single database system. By standardizing how data is described and addressed (e.g., with unique table names or paths), the catalog makes data discoverable and manageable. It also decouples metadata from any single compute engine – multiple tools can rely on the same catalog to understand the data.

This modular lakehouse design (storage + compute + catalog) sets the stage for data packages. Data is stored in open, accessible forms; metadata is available through a catalog; and compute can be provisioned on demand. Thus, packaging a dataset (or a table) along with its metadata simply leverages these components: the files in storage are the payload, and the catalog’s information can be included as the package metadata. Notably, because lakehouse storage is typically immutable and append-only, it aligns well with treating datasets as versioned packages. The decoupling also means a data package can be shared without needing to share an entire database instance – the consumer can plug the package’s data files into their own lakehouse environment and register the metadata in their catalog. In essence, the lakehouse’s open architecture makes data portable and the openness and separation of concerns underpins the viability of shareable data packages in modern data ecosystems.

Data Packages: Self-Describing, Shareable Units of Data¶

With the architectural foundation in place, we turn to data packages themselves – the units of data sharing that are self-describing, shareable, and trustable. A data package can be thought of as a dataset encapsulated with all its necessary context, so that it can travel between systems or teams and remain interpretable and trustworthy. Several key characteristics define a useful data package:

Self-Describing Metadata: Each data package contains rich metadata describing its contents – for example, a schema (fields, types, definitions), documentation of how data was collected or processed, and any relevant contextual details. This ensures that anyone using the package can understand the data without outside explanation. The metadata might be stored in a structured form (such as a JSON or YAML file) inside the package. By being self-describing, a package reduces ambiguity: consumers know the units of measurement, meaning of each column, and any caveats or quality notes.
Immutability for Trust: Once published, a data package is typically immutable, meaning its contents (data and metadata) are not altered retrospectively. Immutability is crucial for building trust and supporting reproducibility – the package that a user downloads today will be bit-for-bit identical in the future, ensuring that analyses can be repeated on the same data. This principle mirrors how container images or software packages are often content-addressed and tamper-evident. Immutability makes it possible to cache and verify data artifacts confidently. According to best practices, data packages being immutable makes them “suitable for reproducible research”, since results can be tied to a specific, unchanging version of data.
Versioning and Lineage: While each release of a data package is immutable, the package as a whole is versionable. New versions (v2, v3, etc.) can be published to incorporate updated data, corrections, or improvements, while the old versions remain available for reference. This versioning approach accommodates the evolving nature of datasets without overwriting or losing history. It also enables clear lineage tracking: one can trace which version of data was used in a given analysis or model training. Modern data package repositories often assign unique identifiers (like checksums or DOIs) to each version, so that they can be precisely cited and retrieved. In essence, versioning provides time travel for data – an ability to go back to the exact data used at a past point in time – which is invaluable for auditing and scientific rigor.
Integrity and Trust Mechanisms: A robust data package includes mechanisms to ensure the data has not been corrupted or tampered with. This could be as simple as checksums for each file, or as advanced as a digital signature on the package. By verifying these, consumers gain confidence that the package is exactly as the publisher intended (authenticity and integrity). We discuss more on trust in the next sections, but it’s worth noting here that treating data as a package opens the door to leveraging software-style trust guarantees (signing, verification, provenance metadata) on data assets. In other words, data packages are trustable units – they can be cryptographically verified, and their origin traced, much like one verifies the signature of a Linux distribution image.
Shareability and Portability: Data packages are designed to be easily shareable across environments. Often they are encapsulated in a single file or a portable folder structure that can be published to a repository or sent over a network. By using open formats internally, they ensure that any tool or platform that adheres to those standards can use the data. This portability is enhanced by standards such as OCI (discussed below) which allow data packages to be pushed to and pulled from common registries. The ultimate goal is that a data package becomes as easy to distribute and deploy as a container image or a library package – enabling data scientists and engineers to fetch data products on-demand and immediately start analyzing, without manual wrangling.

In combination, these features make data packages a powerful concept: they encapsulate not just data itself, but the information needed to understand and trust that data. By moving to a “data package” paradigm, organizations ensure that their data assets are ready for collaboration and AI/analytics use, much like well-documented APIs or well-tested software libraries are ready for integration. This shift greatly improves data governance (each package is a controlled, versioned object) and composability (different data packages can be assembled in a pipeline with minimal friction, since each comes with known schema and quality).

OCI Artifacts: A Unified Packaging Standard for Data and Models¶

Implementing data packages at scale requires a standardized packaging and distribution method. An emerging solution comes from the software world: the Open Container Initiative (OCI) artifact format. Originally developed to standardize container images, the OCI registry and image specifications have been generalized to support arbitrary artifact types – not just containers. This means we can use the same registry infrastructure that stores Docker/OCI images to also store datasets, machine learning models, or other digital content. OCI artifacts provide a convenient packaging envelope and transport mechanism for data packages in the lakehouse context. An OCI artifact, in essence, is a content-addressed bundle stored in an OCI-compliant registry. OCI registries (like Docker Hub, Quay.io, AWS ECR, Harbor, etc.) are designed to host content with strong integrity guarantees (via hashes) and easy distribution via HTTP APIs. By convention, these registries can accept any content type as an “artifact” – including data tarballs, model files, or notebooks – by using the OCI image manifest to describe them. In practice, OCI artifacts allow us to treat datasets and models as first-class artifacts similar to container images, pushing and pulling them from a registry. This approach leverages a widely adopted ecosystem: organizations likely already have container registries and supporting tooling (CLIs, CI/CD integrations, permission management), so using OCI artifacts for data means piggybacking on proven technology.

The benefits of using OCI as a packaging standard for data include:

Content Addressability: Each artifact is identified by a digest (cryptographic hash) of its contents. This ensures immutability and integrity – if the data package changes, its digest changes. Consumers can pull an artifact by a specific digest to guarantee they have the exact version intended. The content-addressable nature of OCI artifacts “assures authenticity and integrity” by design. In other words, if you download a data package artifact by its hash, you can be confident it hasn’t been altered (the registry will verify the hash).
Standard Manifest and Metadata: OCI artifacts use the same manifest structure as container images, which can include references to layers (blobs of data) and annotations. This means a data package can be broken into layers (for example, data files as separate compressed layers) and annotated with metadata (like name, version, type). The OCI spec even allows a custom “artifactType” field to specify the type of artifact (e.g., “application/vnd.example.dataset”), ensuring that tools know this is a dataset package. By conforming to this spec, data packages achieve a self-describing structure within the OCI ecosystem as well.
Uniform Distribution Channel: Using OCI artifacts, one can distribute data through container registries the same way developers distribute software containers. Teams can docker pull or use ORAS (OCI Registry As Storage) CLI to fetch a dataset package. This uniformity means no separate custom data catalogs or file servers are needed for sharing – instead, a unified artifact repository can house code, container images, models, and data together. This is particularly powerful for AI/ML workflows: an ML model artifact can declare which dataset artifact (by digest/tag) it was trained on, all within the same registry system. The entire pipeline’s artifacts live in one place with consistent version tagging.
Security and Access Control: Container registries come with robust access control, authentication, and encryption in transit. By using them for data, organizations inherit these security features. They can control who can push or pull certain data packages, audit access logs, and ensure data packages are encrypted during download/upload. Furthermore, integration with existing DevOps processes (CI pipelines can automatically push new data package versions to a registry, etc.) becomes straightforward.

In summary, OCI artifacts serve as a packaging and distribution backbone for data packages, making it possible to implement “datasets as packages” in a scalable, standard way. The flexibility of OCI manifests means we can include not just the data and metadata in the package, but also attach additional content like signatures, checksums, or even Software Bill of Materials (SBOMs) as separate layers or attached artifacts. This bridges into the next topic: how principles of software supply chain security can be applied to data packages and pipelines.

Applying Software Supply Chain Principles to Data Pipelines¶

As data engineering adopts a product mindset, it increasingly mirrors software development – and thus can benefit from the security and governance principles developed for software supply chains. In a software supply chain, we worry about things like verifying the origin of components, ensuring builds are reproducible, and tracking dependencies (via SBOMs). The data supply chain – encompassing raw data ingestion, transformation pipelines, and final data products – has analogous concerns. Here’s how key software supply chain concepts translate to data:

Digital Signatures for Data Artifacts: Just as container images or software binaries are often signed by their publisher to prove authenticity, data packages can be signed. A publisher can use a tool (for example, Sigstore/cosign in the cloud-native world) to sign the digest of a data package artifact. Consumers then verify this signature (using the publisher’s public key or a transparency log) to confirm the data indeed came from the trusted source and hasn’t been tampered with. Digital signatures imbue data with a verifiable chain of custody – critical for sensitive or regulated data sharing. In practice, integrating signing into data pipelines means each stage that produces an output dataset signs that output. Downstream users or automated processes only accept data with valid signatures from expected parties. This creates an immutable audit trail of data origins.
Attestations and Provenance Metadata: Beyond signing the final artifact, we often need to capture how data was produced. Attestations are a mechanism to record metadata about pipeline steps in a verifiable way. For instance, an attestation could state “Job X ran at time Y using script Z and produced artifact A,” and this statement itself is signed by the infrastructure or pipeline orchestrator. Attestations provide provenance – they answer: was the data package generated by an approved process? Did all validation checks pass? Data engineers can set up pipelines that emit signed attestations for each step. Later, anyone can inspect or verify these attestations to build trust in the final data product. In essence, attestations for data pipelines create a ledger of lineage: each transform, each intermediate output is documented and cryptographically linked.
SBOM for Data and Workflows: A Software Bill of Materials (SBOM) lists components in a software artifact; analogously, we can maintain a “Data BOM” for our data products. This would enumerate the constituent parts of a data pipeline: data sources (with their versions), transformation code, libraries used, and configuration parameters. By packaging this information (possibly as metadata in the data package or as a separate artifact linked via OCI references), one gains transparency into what went into a derived dataset. If a vulnerability or error is later found in one of the inputs, the Data BOM allows quick identification of affected data products. This practice enhances composability and trust: teams can safely build new data products by combining existing packages, knowing each comes with a manifest of its contents and lineage.
Policy Enforcement and Automation: Borrowing again from software supply chain management, we can enforce policies in data pipelines such as: only allow ingestion from certain signed sources, or require an attestation that data quality tests ran (and passed) before a data package is published. Automation tools can check these constraints at pipeline runtime. This ensures that every data package that enters the catalog has been through proper vetting, much like software in a secure supply chain must pass tests and security scans. Over time, an ecosystem of dataops tools is emerging to handle this, making it easier to define and audit the chain of data custody and transformations.

By applying these principles, organizations create a “trusted data supply chain.” Each dataset is traceable back to its raw ingredients, each processing step is documented and verified, and each output is secured against tampering. This dramatically improves confidence for downstream data consumers (analysts, ML models, or even external partners): they can trust that data packages are what they claim to be and understand their pedigree. It also enables modularity – data from one team can be safely incorporated by another, because it comes with guarantees, much like using a well-vetted open-source library in code. As a result, data pipelines become more composable; for instance, a machine learning pipeline can automatically pull the latest certified data package for training, verify its signature and provenance, and proceed, thereby reducing manual checks and errors.

Reproducibility and Lineage: Lessons from Scientific Practice¶

The push for data packaging and supply chain security is tightly linked to the age-old goal of scientific reproducibility. In science and analytics, results must be reproducible to be trusted – one should be able to re-run an experiment or analysis and obtain the same findings. This requires careful tracking of exactly which data, code, and parameters were used. Lack of data versioning and poor record-keeping have contributed to a reproducibility crisis in both academia and industry, where reported results can’t be repeated because the underlying data or code changed or was lost. Modern data architecture and packaging directly address this by emphasizing lineage, versioning, and auditability.

Data packages make reproducibility achievable. Because every package is versioned and immutable, an analysis can reference a specific package version and anyone later can retrieve that exact same data. This is analogous to pinning a software experiment to specific library versions. Moreover, through metadata and attestations, one can record the entire lineage: which raw data went in, what transformations were applied, and so on. If questions arise about a result, an auditor can trace back from the final data package through each intermediate step (using the signed attestations) to the original raw sources – providing a data provenance trail.

Academic and scientific communities have been early adopters of these ideas. For instance, data repositories now publish datasets with DOIs and immutable versions to support reproducible research; These are precisely the practices the data industry is now embracing - by systematically versioning data and capturing metadata at each step, we create the conditions for any analysis to be reproducible by others (or by automated systems) later.

Lineage and auditability go hand in hand with reproducibility. From a governance perspective, it’s important to know where data came from (lineage) and who or what touched it along the way (audit trail). With packaged data units, lineage can be recorded at a fine grain: each table or file can carry references to its parent sources and the transformation job that produced it. Modern metadata catalogs often store this information, allowing for lineage graphs to be visualized. This not only helps in reproducing results but also in impact analysis – if a source data package is found faulty, lineage helps find all downstream results that need updates.

The emphasis on reproducibility is also motivated by the needs of regulated industries and AI ethics. Regulations may require demonstrating how a reported number was computed, or how a machine learning model’s training data was gathered and processed. By adhering to these practices (versioned data packages, recorded lineage), such audits become significantly easier. The ultimate vision is that any data-driven insight can be traced, verified, and if necessary, recreated – instilling greater confidence internally and externally.

Metadata Catalogs Across Domains: The STAC Example¶

While the concepts of data packaging and cataloging are universal, different domains often require domain-specific metadata and standards. A key point is that metadata catalogs are often tailored to the domain’s needs, yet they share the same structural goals: to make data discoverable, understandable, and usable across stakeholders. One illustrative example is the SpatioTemporal Asset Catalog (STAC) in the Earth Observation domain.

STAC is an open specification designed to standardize how geospatial imagery and related assets are described. Satellite imagery datasets have unique metadata needs – such as geospatial extent (coordinates, area of coverage), temporal information (acquisition date/time), sensor details, resolution, etc. Historically, each space agency or provider might expose this metadata differently, making it hard for users to search across multiple sources. STAC emerged to solve this by providing a common language to describe geospatial data so it can be easily indexed and discovered. In STAC, each item (e.g., a satellite image or mosaic) is a JSON document with standardized fields for location, time, and other properties. These items are organized into catalogs and collections that any STAC-compliant tool can crawl or query.

The structure of STAC is analogous to a generic data catalog: there’s a catalog (which can be a hierarchy of datasets), each containing collections (groupings of items, like all images from a satellite mission), and within those, items (individual assets, each with metadata and links to the actual data files). This is very similar to how a general data lakehouse catalog might organize databases, tables, and files – except tuned to the domain’s concepts. STAC’s success demonstrates that when a community agrees on a metadata schema and API, it dramatically improves data findability and interoperability. For example, a scientist can use a single STAC API query to find “all satellite images over California in June 2023 with less than 10% cloud cover,” even if those images come from different satellites or providers, because they all share the STAC metadata format. Bundled and distributed with data together, STAC enables the “domain-specific data package” approach for Earth observation: each image or dataset is described in a uniform way, and thus can be treated as a portable asset.

The STAC example generalizes to other domains: whether it’s healthcare, finance, or genomics, each domain may introduce specialized metadata (e.g., patient anonymization codes, financial instrument identifiers, gene ontology terms). These can often be incorporated as extensions or additional fields on top of a core metadata schema. The important point is that the goals remain the same: to package data with sufficient metadata and standardized structure so that others in the domain can easily find and use it. Domain-specific catalogs still align with the lakehouse and data package philosophy – they often use open formats (JSON, CSV, etc.), emphasize unique identifiers and versioning, and encourage open APIs for search. In fact, STAC’s design includes an API and browser, showing how a catalog can enable both human and machine discovery of data assets.

Another insight from STAC is the use of extensions – it has a mechanism to extend the core spec with additional fields (for example, a “satellite” extension might add orbit details). This echoes the idea that data packages and catalogs in general should be flexible and extensible to accommodate new requirements while staying interoperable. For organizations implementing their own internal data catalogs, a lesson is to follow a consistent structure (perhaps inspired by open standards) but allow per-domain augmentation in a controlled way.

In summary, domain-specific metadata catalogs like STAC demonstrate that while the content of metadata may differ, the architecture of providing self-describing, searchable, shareable data packages is broadly applicable. The lakehouse’s catalog and open storage format approach can incorporate such standards, enabling, for example, an enterprise to manage a STAC-compliant catalog for imagery alongside other catalogs for tabular data – all part of one unified data platform. The outcome is improved data discovery and reuse, within and across domains.

Conclusion¶

The convergence of these trends – lakehouse architectures, data package standards, container-based distribution, and supply chain security – is reshaping how organizations handle data. The move from raw data sharing to data package sharing is a move towards greater structure, trust, and agility in data ecosystems. By decoupling storage, compute, and catalog, the lakehouse provides the flexible canvas on which data packages can be created and shared without friction. By packaging data with metadata (and even code or models), we create units of data that are self-contained and meaningful, ready for consumption by AI models or analytical tools with minimal prep work.

In embracing OCI artifacts for data, we leverage a proven, scalable distribution mechanism that treats data as a first-class citizen. This unification means that a single pipeline can handle code, configuration, and data in similar ways – with common tooling, security scans, and deployments. Moreover, applying software supply chain principles to data injects much-needed trust: we can know who produced a dataset, how, and whether it’s been altered – answering questions that are vital for compliance and for internal quality control.

The emphasis on reproducibility and lineage ensures that as we derive insights or train ML models, we can always retrace our steps. Reproducibility is not just an academic ideal but a practical necessity when models might be audited or results need verification. Data packages with strict versioning and provenance make this feasible, turning “rehydrating” an analysis into a straightforward task instead of a detective mission.

Finally, recognizing that one size does not fit all, we acknowledge the need for domain-specific metadata and catalogs. The key is to follow common structural principles (like STAC does for geospatial data) so that even specialized datasets adhere to the overarching goal of being findable, accessible, interoperable, and reusable.

In a world increasingly driven by data and AI, these approaches lay the groundwork for a more robust data ecosystem. Organizations that adopt data packages and the associated practices will find that their data is not only easier to manage and govern, but also primed to unlock value – analysts spend less time cleaning and more time exploring, models train on known-good data, and data sharing with partners becomes safer and simpler. The journey from raw data to packaged, trustworthy data products is a journey to treating data with the same respect as code: with rigor, automation, and an eye towards collaboration. It is a journey well worth undertaking for any data-driven enterprise striving for excellence in the era of Data & AI.

Versioneer Research

Build & Push (4 tiles)