Docs

Data Sets


Simplifying How Organizations Define, Discover, and Transfer Data

As organizations generate larger volumes of research, operational, AI, and business data, managing how people identify and reference that data becomes increasingly important.

Historically, data collections have often been referenced by technical storage locations, cloud platforms, folder structures, bucket names, or file paths. While this may work for technical teams, it creates confusion for researchers, analysts, business users, and external collaborators who simply need access to the correct data.

MLADU Data Sets

MLADU Data Sets solve this challenge by providing a standardized way to define, organize, and reference collections of data regardless of where the data resides.

By separating data identification from technical connectivity, MLADU helps organizations transfer data faster, reduce errors, and improve collaboration across internal teams and external partners.

What Are MLADU Data Sets?

A MLADU Data Set represents a collection of data that exists at a specific Data Station.

A Data Set may contain:

  • Individual files
  • Groups of files
  • Folder structures
  • Research datasets
  • Clinical trial data
  • AI training datasets
  • Financial records
  • Operational data collections

Rather than identifying data by its storage location, MLADU allows organizations to define the data collection itself as a uniquely identifiable asset.

Each Data Set is assigned:

  • A unique identifier
  • A descriptive name
  • A detailed description
  • Classification information
  • Additional metadata

This allows users to quickly locate and reference the correct collection of data without needing to understand how or where the data is stored.

Data Sets and Data Stations Work Together

A Data Set exists at a Data Station, but it serves a different purpose.

A Data Station defines how MLADU connects to a data storage platform or environment.

A Data Set defines the data collection itself.

Think of it this way:

  • Data Station = How you access the environment
  • Data Set = The data you want

This separation creates a more intuitive experience for users while allowing technical administrators to manage connectivity independently.

Example: Multiple Researchers Using the Same Data Station

Imagine two researchers working within the same AWS S3 environment.

Both researchers access the same Data Station

Data Station

  • Research Genomics Repository

However, they utilize different Data Sets:

Data Set #1

  • Cancer Genomics March 2026

Data Set #2

  • RNA Sequencing Results Q1 2026

Data Set #3

  • Clinical Biomarker Validation Study

Although the data resides within the same storage platform, the Data Sets allow users to clearly distinguish between different collections of information without confusion.

Example: CRO Data Delivery

Contract Research Organizations (CROs) frequently deliver data to sponsors and clients.

Instead of asking clients to locate data within a storage platform, the CRO can define and publish a clearly named Data Set such as:

Data Set Name

  • Phase III Trial Results March 2026

Description

  • Final cleaned clinical trial dataset including patient outcomes and laboratory results.

The client can immediately identify the correct dataset without needing to understand the underlying storage architecture.

Why Data Sets Are Critical for Speed

MLADU Data Sets were designed around a simple principle:

People who need data should not have to think about technical connectivity, infrastructure, integration platforms, or security configurations.

They simply want access to the correct data as quickly as possible.

Simplifying Data Identification

As organizations adopt more cloud platforms, AI systems, research repositories, and analytics environments, it becomes increasingly difficult to identify data based solely on technical storage locations.

Consider the difference between these two references:

Traditional Reference

Azure Storage Container XYZ, Folder ABC, Snapshot March 2026

MLADU Reference

XYZ March 2026

The second example is dramatically easier to understand, communicate, and use.

Eliminating Naming Confusion

One of the most common challenges in large organizations is that different teams often refer to the same data collection using different names.

Examples include:

  • Trial Data
  • March Trial Data
  • Final Trial Data
  • Trial Dataset Version 3
  • Production Trial Export

All of these may refer to the same information.

MLADU Data Sets create a single authoritative definition that everyone can reference consistently.

Faster Data Transfers

Because users can quickly identify the correct Data Set, transfer setup becomes significantly faster.

Instead of navigating technical paths and verifying storage locations, users simply select the desired Data Set and proceed with the transfer.

Over hundreds or thousands of transfers, these efficiencies save substantial time and reduce operational complexity.

Who Can Control Data Sets?

MLADU uses role-based access controls to ensure Data Sets are managed consistently and securely.

Only two roles can modify Data Set definitions and metadata.

Data Owner

The Data Owner is responsible for managing the Data Set catalog within the organization.

Think of the Data Owner as a data steward, curator, or chief librarian.

Responsibilities include:

  • Naming Data Sets
  • Creating descriptions
  • Categorizing information
  • Maintaining metadata
  • Ensuring consistency
  • Managing visibility settings

The Data Owner helps ensure that users can easily locate and understand available data collections.

Portal Admin

The Portal Admin serves as the MLADU account super user.

As a backup administrator, the Portal Admin can also create, modify, and manage Data Sets when necessary.

Standard Users

All other MLADU users are limited to viewing:

  • Data Set ID
  • Data Set Name
  • Data Set Description

This approach allows users to discover and leverage available data while maintaining centralized governance.

Complete Audit Logging

To support governance and compliance requirements, MLADU maintains a detailed audit history of Data Set modifications.

Organizations can track:

  • Who made a change
  • What changed
  • When the change occurred

This visibility supports regulatory compliance, operational accountability, and data governance initiatives.

Public vs. Private Data Sets

MLADU provides flexible visibility controls to support both internal collaboration and external data sharing.

Private Data Sets

By default, all Data Sets are private.

Private Data Sets are visible only within your MLADU organization and can be selected by authorized users during transfer creation.

This model supports most enterprise and research use cases while maintaining strong control over data visibility.

Public Data Sets

Organizations that distribute data to external parties can choose to make Data Sets publicly available.

For a Data Set to be public, it must exist at a Data Station that is also configured as public.

Once enabled, any MLADU user can discover the Data Set by viewing the associated public Data Station.

External users can see:

  • Data Set ID
  • Data Set Name
  • Data Set Description

This allows organizations to advertise available data collections without exposing technical connectivity details.

Ideal for Data Vendors and Research Organizations

Public Data Sets are particularly valuable for:

  • Contract Research Organizations (CROs)
  • Research consortiums
  • Data marketplaces
  • Scientific collaborations
  • Public research initiatives
  • Commercial data providers

For example, a research consortium could publish:

Data Set Name

  • Rare Disease Genome Study 2026

Description

  • Approved genomic sequencing dataset available for participating institutions.

Researchers can easily discover and request the appropriate data without requiring manual coordination.

Available Data Sets

The number of Data Sets available to your organization is determined by two factors.

1. Your MLADU Subscription Plan

Every MLADU subscription includes a predefined number of Data Sets.

This allows organizations to start with a cost-effective solution that meets their initial requirements.

2. Additional Data Set Capacity

Organizations with larger data catalogs can purchase additional Data Sets as needed.

This flexible approach allows customers to expand their data catalog without overpaying for unused capacity.

Whether you manage dozens of datasets or tens of thousands of research collections, MLADU provides a scalable framework for organizing and transferring data efficiently.

Build a Better Data Catalog with MLADU

MLADU Data Sets provide a smarter way to organize, identify, and transfer data.

By separating data collections from technical connectivity, organizations can improve user experiences, reduce confusion, accelerate data transfers, and strengthen data governance.

Whether you are managing clinical trial data, genomic research, AI training datasets, financial information, or operational records, MLADU Data Sets make it easier for users to find the right data and move it where it needs to go.

Start Your Free Trial Today

Ready to simplify how your organization manages and transfers data?

Start a free MLADU trial and experience how Data Sets and Data Stations work together to create a faster, more secure, and more scalable data transfer platform.

Prefer a guided walkthrough?

Schedule a personalized MLADU demonstration and see how organizations are using MLADU to organize, discover, and transfer terabytes and petabytes of data with confidence.

Topics