Pressed for time? We’ve got you covered with a 2-minute summary of the highlights of this article:
What is a data dictionary? A data dictionary can be defined as a collection of metadata such as object name, data type, size, classification, and relationships with other data assets. A data dictionary acts as a reference guide on a dataset.
80% of a data scientist’s valuable time is spent simply finding, cleaning, and organizing data, leaving only 20% to perform analysis, according to HBR.
That’s where a repository of all data assets — column descriptions, metrics, measurement units, and more — can help. That is the purpose of the data dictionary.
Here, we’ll explore the fundamentals of a data dictionary, its examples, templates, best practices, and an action plan to build it; plus an understanding of tools that can help.
A data dictionary is a collection of metadata such as object name, data type, size, classification, and relationships with other data assets. Think of it as a list along with a description of tables, fields, and columns. The primary goal of a data dictionary is to help data teams understand data assets.
According to IBM’s Computer Terminology Dictionary,
a data dictionary is a “centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format. It assists management, database administrators, system analysts, and application programmers in planning, controlling, and evaluating the collection, storage, and use of data.”
a data dictionary as “software in which metadata is stored, manipulated, and defined.”
A data dictionary is used by data administrators, analysts, and engineers to understand and trust data assets. It helps in the creation of authentic, transparent, and consistent data throughout the organization.
According to data governance coach Nicola Askham, you can have multiple data dictionaries as it has details of the systems hosting or holding data assets. So, each data source — a warehouse, lake, or lakehouse — will have a data dictionary.
An enterprise data dictionary is a compilation of metadata such as object name, data type, size, classification, and relationships with other data assets. It can also include business metadata such as the definition, associated business terminology, and metrics. The goal of an enterprise data dictionary is to help business teams understand and use a data set easily.
According to a veteran technical business analyst, the enterprise data dictionary is “the key for any company looking to connect the dots for all users.”
While exploring the concept of a data dictionary, you’ll come across other terms such as data catalog, data glossary, and business glossary. So, let’s look into the differences between these terms before delving into the components of a data dictionary.
A business glossary (also known as a data glossary) covers the business terminology or concepts for an entire organization. The goal is to define a common vocabulary of terms for an enterprise.
The glossary includes a more descriptive name and detailed description of each term, along with possible aliases. In some cases, it also covers specific business rules for defining a term.
Unlike the data dictionary, there can only be one business glossary for an entire organization. Think of it as a common language or a way to talk about the data consistently in an organization.
The business glossary is considered to be a prerequisite for any data governance program and should be available before you start building a data dictionary.
A business glossary is a centralized repository of business terms, KPIs, metrics, and definitions. Image by Atlan.
A data catalog handles the indexing, inventorying, and classification of data assets across multiple data sources in an organization. Modern data catalogs offer rich context on data by crawling data dictionaries and the business glossary for technical, business, and operational metadata.
Crawling all kinds of metadata also helps data catalogs visualize data flow and its lifecycle — the origins, transformations, and upstream and downstream dependencies. Think of it as a platform that tells you the story of each data set.
Additionally, data catalogs also serve as the workspace for collaboration on data.
Both data dictionaries and the business glossary are considered to be integral parts of the modern data catalog.
A data catalog helps users search, discover, understand, and trust data assets in an organization. Image by Atlan.
According to the USGS (US Geological Survey), a data dictionary can include:
Additionally, the data dictionary should also include information on:
Now, let’s look at some examples of data dictionaries.
The data dictionary can be a simple table maintained using a spreadsheet, PDF, or a full-fledged web application. Let’s look at some data dictionary examples.
A good example of a data dictionary is the one used by ORNL (Oak Ridge National Laboratory).
ORNL maintains this dictionary as a PDF and it resembles a detailed index at the end of a book. The document provides basic information (entry type and description) on each entry, called a variable.
What does a data dictionary include? Example from The Human Health Risk Assessment Data Dictionary. Source: ORNL.
The next example is from NASA’s PDS (Planetary Data System). The PDS data dictionary is a web page with a search bar and a listing of all the entries, called attributes. The website allows you to filter your search results in an effort to speed up your research process.
You can click on each attribute to understand it further. The details include technical metadata such as name, data type, the owner (i.e., Registered By), and identifiers for version, registration, authority, etc.
It also contains metrics and data quality indicators such as minimum and maximum values and the unit of measure. Any researcher can look up the terms they need using these dictionaries to make sense of their planetary data.
Example of what constitutes a data dictionary. The PDS Data Dictionary. Source: NASA’s Planetary Data System.
A data dictionary documents data assets with relevant context, making it easy to use, analyze, and discuss data across teams. The biggest benefits of using a data dictionary include:
Modern data platforms automatically generate data quality metrics and statistics so that you can understand the quality of your data at a glance.
Since the data dictionary displays descriptive statistics — minimum, maximum, count, frequency, mean, and median — spotting anomalies in data becomes easy. This helps you avoid inaccuracies or inconsistencies in data.
As mentioned earlier, a data dictionary offers context by documenting metadata as well as data sources/origins, owners, creation dates, and so on. This helps you validate each data set and make sure the information you have is reliable, which makes your decision-making more accurate.
Additionally, modern data platforms such as a data catalog also let you visualize the overall data flow, making it easier to interpret your transformations’ impact on upstream or downstream applications.
Whenever you can’t verify the credibility of a data set, modern data dictionaries let you discuss that data and share it (with just a link or via Slack) with the right people.
Visualize the complete journey of your data asset from source to BI tools. Source: Atlan
If done right, a data dictionary can help you establish certain ground rules for collecting, documenting, and using data. This, in turn, simplifies regulatory compliance.
Since the data dictionary contains all the technical metadata, you can spot which teams or business units aren’t managing their data assets properly and fix those bad data practices.
As mentioned earlier, the data dictionary equips everyone in your organization with a common repository for data definitions, standards, metrics, and more.
So, everyone understands what any element within a data set means without having to consult an expert. This reduces dependencies, helps everyone use the data in the same way, and makes onboarding a breeze.
The purpose of a dictionary is to help you avoid asking questions such as “what does this variable name mean?” or “what is the ideal value for this field?”
That’s why the OSF (Center for Open Science) recommends that your data dictionary should contain:
To ensure that each variable contains the above information, you can ask your teams the following questions:
Data Dictionary Demo - Atlan in 3 minutes: Future of Data Catalogs 🚀
The researchers at Smithsonian adopt the following best practices to define and describe attributes within the data dictionary:
Their process is a good example of how good data teams document their data. Let’s explore this in detail.
Each data dictionary should offer basic information about the dataset. This should include:
The Smithsonian recommends that you follow “the conventions of your discipline when choosing standardized terms or when structuring your data.” This practice also comes in handy with compliance audits.
You should provide a complete definition for each component of the dataset. Next, offer a description that contains the following information:
Versioning the file lets you keep track of changes over time. The versioning is automatic if you use a web-based file system (or wiki) like Git or ArcGIS.
Make sure that your versioning includes descriptions of the changes made — the details of the editors, date and time, elements changed, and so on.
Since several data dictionary tools (both enterprise and open-source) are available, we’ll focus on the capabilities you should look for in a tool.
To begin with, the data dictionary should define all technical terms from data tables or data models — for example, policy_expiration_date, and policy_id. Each of these terms should be linked to tables/dashboards so that data teams can find the information they need faster.
The data dictionary tool should also allow you to set up data definitions and descriptions as mentioned under the best practices listed above.
Additionally, a solid data dictionary for modern data teams should have the ability to:
Collecting vast amounts of data is only useful if you can interpret or analyze it. A data dictionary is like a README that documents everything you need to know about a dataset in order to use it for further analysis.
If you are looking to build a modern data dictionary, take Atlan for a spin. Atlan is more than a standard data dictionary. It’s a third-generation modern data catalog built on the framework of embedded collaboration, borrowing principles from GitHub, Figma, Slack, Notion, Superhuman, and other modern tools that are commonplace today.