Skip to article frontmatterSkip to article content

Building a Trusted and Reproducible Data Ecosystem with OCI

In today’s data-driven landscape, organizations are moving beyond ad-hoc raw data sharing towards a disciplined approach of sharing data packages. This shift parallels the evolution of modern data architecture: the data lakehouse combines the flexibility of data lakes with the reliability of data warehouses. By packaging data with its metadata and integrity information, teams can treat datasets as first-class assets – analogous to software artifacts – that are easier to share, govern, and reuse for analytics and AI. This paper explores how decoupled lakehouse architectures set the stage for data packages, how Open Container Initiative (OCI) artifact standards enable unified distribution of data and models, and how applying software supply chain principles (signatures, attestations, SBOMs) to data pipelines fosters trust, reproducibility, and domain-specific discovery.

Evolving from Raw Data Sharing to Data Packages

Traditional data sharing often meant providing raw files or database dumps with minimal context. Consumers of such raw data struggled to understand origins, schema, or proper usage, leading to misinterpretation and difficulty. Enter data packages: a concept pioneered in open-data communities to bundle raw data with its metadata, making data self-describing and more usable. A data package typically contains the dataset (or a collection of related data files) alongside descriptive metadata – such as schema, definitions, and documentation – in a single portable unit. By including this contextual information, a data package provides clarity about what the data represents, how it was collected or processed, and how it should be interpreted. Data packages transform data into shareable data products that can be distributed much like software packages. Instead of simply handing off a raw CSV or parquet file, a data provider delivers a self-contained package that consumers can load and understand without guesswork - enabling anyone to find out the dataset’s fields, types, and meaning. The result is data that is not only open but also accessible: richly described, versioned, and ready for integration.

Crucially, data packages address the shortcomings of raw dumps by being self-describing, trustable units of data. “Self-describing” means the package itself tells you its schema and provenance; “trustable” implies that the package can be verified and comes from a reliable source. By moving to data packages, organizations lay the groundwork for better governance – every dataset can carry its own documentation, schema, and lineage info – and for analytics and AI projects to more easily consume data without manual prep. In summary, the shift from raw data to data packages represents a maturation in data management: treating data with the same rigor as software, complete with metadata, version control, and quality checks.

Lakehouse Architecture: Decoupled Storage, Compute, and Catalog

Supporting the rise of data packages is the modern lakehouse architecture, which provides an open and modular foundation for data management. In a lakehouse, the traditional all-in-one data platform is split into decoupled storage, compute, and catalog components, each independently managed yet integrated to deliver a unified experience. This decoupling is key to enabling flexible data sharing and packaging:

This modular lakehouse design (storage + compute + catalog) sets the stage for data packages. Data is stored in open, accessible forms; metadata is available through a catalog; and compute can be provisioned on demand. Thus, packaging a dataset (or a table) along with its metadata simply leverages these components: the files in storage are the payload, and the catalog’s information can be included as the package metadata. Notably, because lakehouse storage is typically immutable and append-only, it aligns well with treating datasets as versioned packages. The decoupling also means a data package can be shared without needing to share an entire database instance – the consumer can plug the package’s data files into their own lakehouse environment and register the metadata in their catalog. In essence, the lakehouse’s open architecture makes data portable and the openness and separation of concerns underpins the viability of shareable data packages in modern data ecosystems.

Data Packages: Self-Describing, Shareable Units of Data

With the architectural foundation in place, we turn to data packages themselves – the units of data sharing that are self-describing, shareable, and trustable. A data package can be thought of as a dataset encapsulated with all its necessary context, so that it can travel between systems or teams and remain interpretable and trustworthy. Several key characteristics define a useful data package:

In combination, these features make data packages a powerful concept: they encapsulate not just data itself, but the information needed to understand and trust that data. By moving to a “data package” paradigm, organizations ensure that their data assets are ready for collaboration and AI/analytics use, much like well-documented APIs or well-tested software libraries are ready for integration. This shift greatly improves data governance (each package is a controlled, versioned object) and composability (different data packages can be assembled in a pipeline with minimal friction, since each comes with known schema and quality).

OCI Artifacts: A Unified Packaging Standard for Data and Models

Implementing data packages at scale requires a standardized packaging and distribution method. An emerging solution comes from the software world: the Open Container Initiative (OCI) artifact format. Originally developed to standardize container images, the OCI registry and image specifications have been generalized to support arbitrary artifact types – not just containers. This means we can use the same registry infrastructure that stores Docker/OCI images to also store datasets, machine learning models, or other digital content. OCI artifacts provide a convenient packaging envelope and transport mechanism for data packages in the lakehouse context. An OCI artifact, in essence, is a content-addressed bundle stored in an OCI-compliant registry. OCI registries (like Docker Hub, Quay.io, AWS ECR, Harbor, etc.) are designed to host content with strong integrity guarantees (via hashes) and easy distribution via HTTP APIs. By convention, these registries can accept any content type as an “artifact” – including data tarballs, model files, or notebooks – by using the OCI image manifest to describe them. In practice, OCI artifacts allow us to treat datasets and models as first-class artifacts similar to container images, pushing and pulling them from a registry. This approach leverages a widely adopted ecosystem: organizations likely already have container registries and supporting tooling (CLIs, CI/CD integrations, permission management), so using OCI artifacts for data means piggybacking on proven technology.

The benefits of using OCI as a packaging standard for data include:

In summary, OCI artifacts serve as a packaging and distribution backbone for data packages, making it possible to implement “datasets as packages” in a scalable, standard way. The flexibility of OCI manifests means we can include not just the data and metadata in the package, but also attach additional content like signatures, checksums, or even Software Bill of Materials (SBOMs) as separate layers or attached artifacts. This bridges into the next topic: how principles of software supply chain security can be applied to data packages and pipelines.

Applying Software Supply Chain Principles to Data Pipelines

As data engineering adopts a product mindset, it increasingly mirrors software development – and thus can benefit from the security and governance principles developed for software supply chains. In a software supply chain, we worry about things like verifying the origin of components, ensuring builds are reproducible, and tracking dependencies (via SBOMs). The data supply chain – encompassing raw data ingestion, transformation pipelines, and final data products – has analogous concerns. Here’s how key software supply chain concepts translate to data:

By applying these principles, organizations create a “trusted data supply chain.” Each dataset is traceable back to its raw ingredients, each processing step is documented and verified, and each output is secured against tampering. This dramatically improves confidence for downstream data consumers (analysts, ML models, or even external partners): they can trust that data packages are what they claim to be and understand their pedigree. It also enables modularity – data from one team can be safely incorporated by another, because it comes with guarantees, much like using a well-vetted open-source library in code. As a result, data pipelines become more composable; for instance, a machine learning pipeline can automatically pull the latest certified data package for training, verify its signature and provenance, and proceed, thereby reducing manual checks and errors.

Reproducibility and Lineage: Lessons from Scientific Practice

The push for data packaging and supply chain security is tightly linked to the age-old goal of scientific reproducibility. In science and analytics, results must be reproducible to be trusted – one should be able to re-run an experiment or analysis and obtain the same findings. This requires careful tracking of exactly which data, code, and parameters were used. Lack of data versioning and poor record-keeping have contributed to a reproducibility crisis in both academia and industry, where reported results can’t be repeated because the underlying data or code changed or was lost. Modern data architecture and packaging directly address this by emphasizing lineage, versioning, and auditability.

Data packages make reproducibility achievable. Because every package is versioned and immutable, an analysis can reference a specific package version and anyone later can retrieve that exact same data. This is analogous to pinning a software experiment to specific library versions. Moreover, through metadata and attestations, one can record the entire lineage: which raw data went in, what transformations were applied, and so on. If questions arise about a result, an auditor can trace back from the final data package through each intermediate step (using the signed attestations) to the original raw sources – providing a data provenance trail.

Academic and scientific communities have been early adopters of these ideas. For instance, data repositories now publish datasets with DOIs and immutable versions to support reproducible research; These are precisely the practices the data industry is now embracing - by systematically versioning data and capturing metadata at each step, we create the conditions for any analysis to be reproducible by others (or by automated systems) later.

Lineage and auditability go hand in hand with reproducibility. From a governance perspective, it’s important to know where data came from (lineage) and who or what touched it along the way (audit trail). With packaged data units, lineage can be recorded at a fine grain: each table or file can carry references to its parent sources and the transformation job that produced it. Modern metadata catalogs often store this information, allowing for lineage graphs to be visualized. This not only helps in reproducing results but also in impact analysis – if a source data package is found faulty, lineage helps find all downstream results that need updates.

The emphasis on reproducibility is also motivated by the needs of regulated industries and AI ethics. Regulations may require demonstrating how a reported number was computed, or how a machine learning model’s training data was gathered and processed. By adhering to these practices (versioned data packages, recorded lineage), such audits become significantly easier. The ultimate vision is that any data-driven insight can be traced, verified, and if necessary, recreated – instilling greater confidence internally and externally.

Metadata Catalogs Across Domains: The STAC Example

While the concepts of data packaging and cataloging are universal, different domains often require domain-specific metadata and standards. A key point is that metadata catalogs are often tailored to the domain’s needs, yet they share the same structural goals: to make data discoverable, understandable, and usable across stakeholders. One illustrative example is the SpatioTemporal Asset Catalog (STAC) in the Earth Observation domain.

STAC is an open specification designed to standardize how geospatial imagery and related assets are described. Satellite imagery datasets have unique metadata needs – such as geospatial extent (coordinates, area of coverage), temporal information (acquisition date/time), sensor details, resolution, etc. Historically, each space agency or provider might expose this metadata differently, making it hard for users to search across multiple sources. STAC emerged to solve this by providing a common language to describe geospatial data so it can be easily indexed and discovered. In STAC, each item (e.g., a satellite image or mosaic) is a JSON document with standardized fields for location, time, and other properties. These items are organized into catalogs and collections that any STAC-compliant tool can crawl or query.

The structure of STAC is analogous to a generic data catalog: there’s a catalog (which can be a hierarchy of datasets), each containing collections (groupings of items, like all images from a satellite mission), and within those, items (individual assets, each with metadata and links to the actual data files). This is very similar to how a general data lakehouse catalog might organize databases, tables, and files – except tuned to the domain’s concepts. STAC’s success demonstrates that when a community agrees on a metadata schema and API, it dramatically improves data findability and interoperability. For example, a scientist can use a single STAC API query to find “all satellite images over California in June 2023 with less than 10% cloud cover,” even if those images come from different satellites or providers, because they all share the STAC metadata format. Bundled and distributed with data together, STAC enables the “domain-specific data package” approach for Earth observation: each image or dataset is described in a uniform way, and thus can be treated as a portable asset.

The STAC example generalizes to other domains: whether it’s healthcare, finance, or genomics, each domain may introduce specialized metadata (e.g., patient anonymization codes, financial instrument identifiers, gene ontology terms). These can often be incorporated as extensions or additional fields on top of a core metadata schema. The important point is that the goals remain the same: to package data with sufficient metadata and standardized structure so that others in the domain can easily find and use it. Domain-specific catalogs still align with the lakehouse and data package philosophy – they often use open formats (JSON, CSV, etc.), emphasize unique identifiers and versioning, and encourage open APIs for search. In fact, STAC’s design includes an API and browser, showing how a catalog can enable both human and machine discovery of data assets.

Another insight from STAC is the use of extensions – it has a mechanism to extend the core spec with additional fields (for example, a “satellite” extension might add orbit details). This echoes the idea that data packages and catalogs in general should be flexible and extensible to accommodate new requirements while staying interoperable. For organizations implementing their own internal data catalogs, a lesson is to follow a consistent structure (perhaps inspired by open standards) but allow per-domain augmentation in a controlled way.

In summary, domain-specific metadata catalogs like STAC demonstrate that while the content of metadata may differ, the architecture of providing self-describing, searchable, shareable data packages is broadly applicable. The lakehouse’s catalog and open storage format approach can incorporate such standards, enabling, for example, an enterprise to manage a STAC-compliant catalog for imagery alongside other catalogs for tabular data – all part of one unified data platform. The outcome is improved data discovery and reuse, within and across domains.

Conclusion

The convergence of these trends – lakehouse architectures, data package standards, container-based distribution, and supply chain security – is reshaping how organizations handle data. The move from raw data sharing to data package sharing is a move towards greater structure, trust, and agility in data ecosystems. By decoupling storage, compute, and catalog, the lakehouse provides the flexible canvas on which data packages can be created and shared without friction. By packaging data with metadata (and even code or models), we create units of data that are self-contained and meaningful, ready for consumption by AI models or analytical tools with minimal prep work.

In embracing OCI artifacts for data, we leverage a proven, scalable distribution mechanism that treats data as a first-class citizen. This unification means that a single pipeline can handle code, configuration, and data in similar ways – with common tooling, security scans, and deployments. Moreover, applying software supply chain principles to data injects much-needed trust: we can know who produced a dataset, how, and whether it’s been altered – answering questions that are vital for compliance and for internal quality control.

The emphasis on reproducibility and lineage ensures that as we derive insights or train ML models, we can always retrace our steps. Reproducibility is not just an academic ideal but a practical necessity when models might be audited or results need verification. Data packages with strict versioning and provenance make this feasible, turning “rehydrating” an analysis into a straightforward task instead of a detective mission.

Finally, recognizing that one size does not fit all, we acknowledge the need for domain-specific metadata and catalogs. The key is to follow common structural principles (like STAC does for geospatial data) so that even specialized datasets adhere to the overarching goal of being findable, accessible, interoperable, and reusable.

In a world increasingly driven by data and AI, these approaches lay the groundwork for a more robust data ecosystem. Organizations that adopt data packages and the associated practices will find that their data is not only easier to manage and govern, but also primed to unlock value – analysts spend less time cleaning and more time exploring, models train on known-good data, and data sharing with partners becomes safer and simpler. The journey from raw data to packaged, trustworthy data products is a journey to treating data with the same respect as code: with rigor, automation, and an eye towards collaboration. It is a journey well worth undertaking for any data-driven enterprise striving for excellence in the era of Data & AI.