The data warehouse is an integral component for your business intelligence team. It lives at the heart of the BI process and serves as the entity that enables the handshake between your company’s data and your company’s reporting and analytics. Its importance cannot be understated, because it is the tool that will most quickly accelerate your BI initiative and ensure your team’s operational success.
A modern data warehouse offers a number of optimizations for analysis and addresses the following considerations:
- Data Warehouse Architecture
- Data Centralization
- Data Standardization
- Data Quality
- Data Refresh
- Dimensional Modeling
- Data Warehouse Design
Data Warehouse Architecture
The primary way in which a data warehouse is different from the relational database that your engineering team is traditionally used to using is that the data warehouse is optimized for data analysis through aggregations rather than data transactions. This becomes immediately obvious when you access the tables at the most user facing end of the data warehouse. You will quickly see that data is duplicated and denormalized, tables are wide and columnar, and the data has been transformed for human readability, rather than for database storage optimization. This type of analytical architecture makes business analytics easier to conduct whereas analysis in the traditional engineering transactional database is more difficult to achieve.
Data Centralization
Another major difference between the engineering database and the data warehouse is that the data warehouse collects and stores the data for all your business processes. Your business likely has data stored across many different platforms and it should all be ingested and processed to live within your data warehouse. This data is likely far more encompassing than just the data you collect within your engineering database. It can include, but is not limited to, sources like:
- On-premise or cloud engineering databases
- Local or cloud files (.csv, .txt, .pdf, etc.)
- Third-party vendor data (Google Analytics, QuickBooks, HubSpot, etc.)
The centralization of all your company’s data within the data warehouse is really the first step to unlocking the ability to conduct effective business intelligence. With all this data collected, the BI team can start to ask and answer questions about relationships spanning across your entire business domain.
Data Standardization
The data standardization process enables your BI team to make data transformations from within your data warehouse. This is the best place within your BI process to perform data transformations because it can easily be built into the process, and it is easy to make changes to that process. When the data lands within your data warehouse through your data pipelines, it is best to make as near a copy of the data source as possible and standardize the transformation process from within your data warehouse. Standardizations here depend on the business need and the business function, but the BI team will likely work closely with the business stakeholders to standardize the readability of all the data flowing through the warehouse.
By the time the data then reaches a user interface through a report or query, it is in a business-friendly format. This standardization can come in many different forms, but it is all about making the data as understandable and insightful to the business user as possible.
Data Quality
Once your BI team starts to work with your company data, it is inevitable that the team will quickly find the data flowing into the data warehouse is not perfect. There will be bad data and that data should be tracked for its quality. The process for tracking data quality involves some form of tagging the data for its consistency, accuracy, and integrity. This will differ depending on the dataset, but it becomes necessary because data standardization is not always possible when dealing with data. Sometimes data is going to flow through that is unexpected, missing, or just plain dirty. Profiling the quality of the data that flows through the data warehouse will help to prevent a situation where dirty data is being examined and acted upon. Business decisions should be made by examining and analyzing high quality data that has gone through a rigorous data quality process.
Data Refresh
The frequency for how often the data is refreshed in your data warehouse is a managed process determined by the business need. If business users want to be able to view the data in near real time, there are options to support this via data streaming. Most data sources only need to be updated daily, weekly, monthly, or at various other frequencies. This refresh process can be managed as a collective effort between your data pipelines, data warehouse, and reporting methods.
Another component of the data refresh that can help you determine how effectively and efficiently data is flowing into and through your data warehouse is some sort of log tracking. This log tracking can be managed from within your data warehouse and your BI team can monitor your data pipeline for efficiency and success rate. This logging can be reported, and it can be used to implement alerting, so the BI team is aware of slow or failing processes. Logging your systems becomes important as the data warehouse grows. It helps the BI team stay aware of failed data refreshes and processes growing slower over time. This system becomes imperative as the data warehouse becomes more automated and knowing the success rate and bottlenecks of your data movement is business critical.
Dimensional Modeling
One of the most defining aspects of business intelligence that is enabled by the data warehouse is dimensional modeling. An in-depth understanding of all aspects of dimensional modeling is complicated – but at its core dimensional modeling is a process that relates your data together for analysis. This modeling process works to create what are known in the BI world as facts and dimensions.
Facts are fields of data that can be aggregated like sales price or sales quantity. Dimensions are fields of data that you can slice aggregations by such as salesclerk or sales date. Facts and dimensions are the building blocks for data marts which are self-contained business processes that are relationships built up out of many facts and dimensions. They describe one business process through the collection of data relating to that process. The data mart is made up of all the different data sources, describing one specific business process, that have been centralized, standardized, and processed through your data warehouse.
Data Warehouse Design
There are many different approaches to an effective data warehouse implementation. Your businesses’ data warehouse can be designed as a single database or a collection of databases. The layering and design of the warehouse, though, is typically at least three-fold consisting of layers for:
- data integration
- data lake
- data mart
This design can be more expansive and differ depending on your approach, but these layers are standards for an effective implementation.
The data integration layer allows the BI team to host a landing target for copies of all your different business datasets. This data should be copied over in its original form so that the data is preserved and is easily modified in the future. This copy also enables the data warehouse to serve as a historical source of truth for the business.
A data lake is a collection of cleaned data that sources itself from the data integration layer through pipes of transformative processes. This data lake layer will have seen the data cleaned into fully human readable form where business familiar terminology is used to make the data more business usable than it may have previously been.
The dart mart layer of your data warehouse is a collection of department or process driven operations that are formed from the data lake layer of the warehouse. These data mart collections are schemas that relate to a single department or business process. For example, you might have a data mart for your finance team that has data relating to store sales. These store sales have various measurements and dimensions associated with them such as quantity of sale, date of sale, store location, salesclerk, and various other descriptors. All this information may be modeled for consumption within a single data mart.
The modern data warehouse is a powerful tool for your business intelligence team. It enables them to collect all the business data they need so that they can process, clean, and model it for analysis and reporting. With a strong BI team and a great data warehouse, your company will reap the reward of seeing the full interoperability of their business!