Small Data Warehouse Options In Azure.

Monce Picard
4 min readJan 7, 2023

--

With rapid technological developments and intense competition, most businesses need data analytics to get good insights to compete and grow. Data analytics is no longer a dreamy future, but an urge must-have thing. Platforms for data analytics should be set up as soon as possible for any existing transaction systems, even better if we are just starting to build systems that will generate data. Architecting it earliest is a wise move.

One of the common component of data analytics platform is a data warehouse. It stores vast amounts of data, current and historical, for business intelligence tools to analyze and use. But what if, the data is not that big yet, cases that usually occur in newly developed systems. Often, we call this as small data warehouse. What is the option to build it?

Azure is a cloud-based platform provider, has a wide range of technologies to support data analytics needs, including data warehousing. Many objects can be used to implement data warehousing requirement. Among the things to consider when choosing the object that is fit for small data warehouse solution are cost and performance, which are the topic of this article.

To compare performance, we use a simplified set of data: one Dimension table (Member) and one Fact table (Trade). The structures of data as follow:

CREATE TABLE [DimMember](
[Code] [varchar](20) NOT NULL,
[MemberId] [numeric](18, 0) NOT NULL,
[Name] [varchar](50) NULL
);


CREATE TABLE [FactTrade] (
[MarketId] [smallint] NOT NULL,
[TradeIdentifier] [int] NOT NULL,
[BusinessDateId] [smallint] NOT NULL,
[ProductId] [int] NOT NULL,
[Price] [numeric](18, 2) NOT NULL,
[Quantity] [numeric](20, 5) NULL,
[SellerMemberId] [int] NOT NULL,
[BuyerMemberId] [int] NOT NULL
);

We have 3 samples of data:
- table FactTrade contains of almost 10 millions of rows (Small Fact)
- table FactTrade contains of more than 64 millions of rows (Medium Fact)
- table FactTrade contains of more than 128 millions of rows (Large Fact)

We use simple query as follows:

SELECT dm.Code 
, sum (ft.Quantity * ft.Price) AS totalAmount
FROM [dbo].[DimMember] dm
INNER JOIN [dbo].[FactTrade] ft
ON dm.MemberId = ft.BuyerMemberId
GROUP BY dm.Code;

We try simple indexing to reduce page reads on Azure SQL database as follows:

CREATE NONCLUSTERED INDEX IX_FactTrade ON FactTrade 
(BuyerMemberId) INCLUDE (Quantity, Price);

Here is the result:

Query comparison result in second. (GP: General Purpose, HS: Hyper Scale)

We run more than one time for each condition to get performance range. For example: on Azure SQL General Purpose with 2 cores, we get query result between 26 seconds (fastest) — 41 seconds (longest), using Medium Fact. We highlight green if one of the number is less or equal than 10 seconds. We highlight yellow if one of the number is less or equal than 20 seconds.

To compare cost, we use Azure Pricing Calculator with component as follows:

Azure components option price list comparison.

As a note, we only use 50 GB (as relative small as possible) on Azure SQL database, compare 1 TB (smallest portion) on Azure Synapse SQL Pool for Calculator. But on our test, even with more than 128 millions of rows on single fact table, it will consumed under 20 GB of storage on Azure SQL database.

By looking at the query performance results and the Azure resource pricing table, we can pick up a few points as decision base for building a small data warehouse.

  • If we want to query table with under 10 millions of rows, we should use Azure SQL Database General Purpose with 2 vcores. It will enough. We can get result in under 10 seconds with the lowest monthly cost.
  • If we want to query table with above 10 millions of rows, we need to take number for how long we should wait as our acceptable criteria. Lets say, if 30 seconds is our threshold, we could choose Azure SQL Database General Purpose with 4 vcores.
  • If we have total data size reach 1 TB and above, with many tables consists of more than 128 millions of rows, we would select Synapse Analytics SQL Pool as our best performance and cost option.

Many scenario can be added, for example: we do not need use the data warehouse for 7x24, so we can pause and resume it; require backup storage for disaster recovery strategy; data growth surpass the 10 millions of rows in short time; etc. These things will make us consider to adjust some points for deciding what to use for small data warehouse.

--

--