Big Data Warehouse Migration Roadmap

Recently, we solved the problem of data warehouse migration at work. It is one of the largest in the industry, at least in our country. It turned out that even at the planning stage, everything is not as simple as it seems. I share the problems, their solutions and the resulting roadmap.

Introduction

It became clear in 2024 that the storage needed to be migrated to a new platform. The migration could no longer be postponed due to the need to further scale the solution and increase the volume of data. It was impossible to do this on the existing Western proprietary platform due to sanctions.

Therefore, we decided to migrate to one of the Russian platforms. I deliberately do not name which one, since it is not of fundamental importance for our task. The approaches outlined can be used to plan migration from any data platform to any other.

The team actively participated in this task with me:

Novruz Mamedov, data architect;
Alexey Kulagin, systems analyst;
Evgeny Larin, systems architect;
Sergey Strelnikov, developer.

In order to move and not lose anything along the way, we started with a survey of the storage facility and built a detailed migration plan. Let’s talk about this task in more detail.

To achieve this, we collected detailed information about each significant object, including corporate sandboxes. There were a lot of all storage objects, including sandboxes — more than 9,000.

In order to preserve the functionality familiar to business users during migration, we identified the links between objects and their dependencies on each other. There were even more links than objects. We got a real web, including objects of source systems and links to them.

The relationship tree can be represented (and is often convenient) as a table. I will give an example of such a table in the section “Surveying the data warehouse”.

Initial data

We started building a very detailed plan, detailing it down to a specific data warehouse object. And already at this stage, problems started to emerge. It turned out that our huge data warehouse, about 1 PTb in size, had some of its functionality not described. Of course, there is nothing surprising about this. Any data warehouse, especially if it has been developing for decades, accumulates problems, and we are no exception. In addition to insufficient description, we have:

the attribute data lineage was missing in full;
There was no matrix of work required for migration.

It was difficult to understand how to approach this. And we decided to start with restoring the description of object dependencies inside the repository.

Survey

First, we identified inaccuracies in the existing description. The problem was that the necessary data was scattered across different business units. It was impossible to gather all the employees who had the necessary information and immediately get everything we needed from them. So we created questionnaires. Below is an example of the information we collected for each consumer/report system.

Provided information about the report	Report ID Report Title Purpose of the report Business area of the report Link to BI portal Report Owner Department Contact information for the report owner Data readiness time requirements Data storage depth Data refresh rate Data update depth Data Quality Requirements Requirements for separation of access rights Additional requirements
Information about the data sources for the report, detailed down to attributes	Report ID Report Title Report indicator The storage object corresponding to the report indicator and the attribute of this object
Questionnaire about the consumer system	System name Subsystem name Business area of the system Purpose of the system (in what processes it participates) The unit that owns the system Contact information about the system owner
Information about data sources for the consumer system, detailed down to storage objects	System name Subsystem name Storage object Data readiness time requirements Data storage depth Data refresh rate Data update depth

We had to make several templates, since the requested data for different business units also differed. But with the help of the survey, we managed to find out which data warehouse objects are used as sources of corporate reporting, in integration with consumer systems and for other purposes. As a result, we got a list of objects on which each business functionality is based. So there was one problem less.

After that, it was possible to deal with the second problem. The existing data lineage was not complete, and it was necessary to close the “gaps” in it. To do this, based on the existing metadata, a universal algorithm was developed for forming a dependency tree (data lineage) for any storage object to the data source.

We have presented the dependency tree as a table. Below is a schematic example of such a table without mentioning the names of specific layers, each company will have its own set of them:

Source system	Storage Layer 1	…	Storage Layer N	System - consumer / Report
Source object	Layer 1 object	…	Layer N object	Report
etc.	…	…	…	…

In the process, thanks to the algorithm, we identified and eliminated “gaps” in the data lineage of some objects. And the improvement of the technical quality of metadata became a pleasant side effect of our work.

The algorithm also identified a significant number of showcases that were being formed in a non-industrial way, i.e. in “sandboxes”. This allowed us to clarify the scope of the work, and “industrializing” such showcases became one of the tasks of the migration project.

Thus, one of the layers of our storage became “sandboxes”.

Information from the data lineage made it possible to classify all storage objects as objects of one of the layers:

Raw data layer - RAW;
Detail layer - DDS;
Base showcase layer - BaseMart;
Business showcase layer - BusinessMart;
“Sandboxes” - SandBoxes.

As a result, the survey allowed us to describe all the existing dependencies in the data warehouse, and we could move on.

Job Directory, Labor Drivers, and Halstead Metrics

We had to solve one more important problem and prepare a matrix of works required for migration. The list of works itself was obtained from numerous interviews with experts from the data management department and business units. It turned out to be a detailed list with full detailing of works at all stages of creating a data warehouse by layers.

A fragment of the list of works (as an example) is given below:

Job

Development of a logical data model

Architectural review of the logical data model

Coordination of the logical data model

Creating a data transformation prototype

Filling source-to-target

Developing data loading and transformation flows

Conducting acceptance tests

Deploying a release to production

etc.

For each type of work in the list, we have defined a driver, that is, an entity that can be counted and, depending on which, labor costs can be measured.

Example of a connection between a list of works and drivers:

Job	Driver
Create LMD in Power Designer	Number of DDS table attributes
Form S2T	Number of DDS tables
Form requirements/methodology for forming showcases	Number of business showcases
etc.

Labor cost metrics were defined:

for each type of work;
for each role (systems analyst, developer, architect, business analyst, Data Governance analyst);
for each driver.

Now, with an accurate list of storage objects, types of work and drivers, as well as metrics by roles, we could calculate labor costs.

It is worth noting that for most types of work, labor costs metrics are set by experts in the form of constants. But there is such a type of work as creating a prototype in the form of SQL code based on the analysis of the code of stored procedures in the current KHD. In this case, labor costs will depend on the “complexity” of the code (number of entities, attributes, number of operations) and cannot be set as a constant.

To assess the complexity of the code and calculate the labor costs for its migration, Alexey Kulagin and Sergey Strelnikov proposed using and implemented an algorithm based on Halstead metrics, adapting it for SQL.

SEL
Cnt_method AS calc_name,tablekind AS calc_type,Count(DISTINCT CASE WHEN token_type = 'Operation' THEN metric_name ELSE NULL END)        AS nu_op     ,Count(DISTINCT CASE WHEN token_type in ( 'Operand','Common') THEN metric_name ELSE NULL END)   AS nu_opd,Sum(CASE WHEN token_type = 'Operation' THEN metric_value ELSE NULL END) AS n_sm_op
,Sum(CASE WHEN token_type in ( 'Operand','Common') THEN metric_value ELSE NULL END) AS n_sm_opd
,nu_op+nu_opd AS n_dict
,n_sm_op+n_sm_opd AS n_length
,n_length*(Log(n_dict)/Log(2)) AS v_length
,(nu_op*n_sm_opd)/(2*nu_opd) AS d_length
,d_length*v_length AS e_length
,18 AS k_psy
,e_length/k_psy AS t_length
,t_length/3600 AS cost_mh
,NULL AS ctg_complex_name
FROM migration.t_obj_complexity

Using well-known industry metrics made it easier for all stakeholders to agree on the migration roadmap.

Now we had all the relationships within the repository and an assessment of all the work for each object by role. All that was left was to split the repository into blocks in such a way as to migrate in stages and consistently introduce the functionality into operation.

Splitting storage into migration blocks

We split the storage into layers, but this was not enough for detailed planning, since the storage is very large and different specialists at the business unit level are responsible for its different parts. Therefore, we began to split all storage objects into functional migration blocks:

We identified a list of major subject areas into which the current data warehouse is divided.
Each of the large subject areas was divided into smaller sets of objects according to their functional purpose — into so-called functional migration blocks. For example, we had the following functional blocks: assortment, checks, operations, etc.

We assigned all objects of the following layers RAW, DDS, BaseMart to a specific functional migration block.

It turned out to be impossible to split the BusinessMart showcase layer objects and SandBoxes into functional blocks. The reason is our storage. Any business showcase was assembled from data from different functional blocks. Also, unfortunately, during the preliminary survey we were unable to identify clear business priorities for migrating showcases.

Then we went the other way and split the showcase layer objects into what we called migration pools. Pooling worked like this:

For each object of the showcase layer and the sandbox layer, a complete dependency tree was built up to the objects of the source systems.
All objects of the showcase layer and the sandbox layer were sorted so that before each specific object in this list there were all the objects of the showcase layer to which it refers directly or indirectly (that is, through other objects).
For each object, the values of the labor costs required for its migration were known in advance. We divided this sorted list into groups so that each group contained approximately the same amount of labor costs.
We called these groups showcase migration pools.

Our survey showed which reports and consumer systems were “fed” by each storefront, so we could tell exactly which reports and consumer system flows could be migrated to the new platform after each specific pool was migrated.

This approach of dividing the showcases into pools is very important. It is the criterion for stage-by-stage migration when drawing up a roadmap.

This mechanical division of display cases, oddly enough, led to their unification into groups based on related features.

After we identified all the dependencies of objects, labor costs for each type of migration work and divided the objects into groups in which we will migrate. Finally, we could begin to form a migration roadmap.

The final migration roadmap

The final map has two large consecutive stages:

Storage Core Migration - Migration of RAW, DDS and BaseMart layers;
Migration of the showcase layer – BusinessMart and SandBoxes.

These stages actually proceed sequentially, with a small “overlap” of one month. That is, first the migration of the storage core is completely completed, and then the migration of the showcase layer begins.

The migration of the storage core is divided into functional blocks. Each block can be migrated independently. Therefore, in our case, some functional blocks were developed in parallel to reduce the overall duration of the migration project.

All work on kernel migration within the functional block was combined into four stages:

systems analysis (bringing order and closing inaccuracies in documentation);
RAW layer migration;
DDS layer migration;
BaseMart layer migration.

Schematically, the sequence of execution looks like this:

That is, the system analysis and migration of the RAW layer are carried out in parallel. Then the migrations of the DDS layer and the BaseMart layer are carried out sequentially.

The roadmap for the storage core migration looks like this:

Scenario	Conservative
Role	Systems Analyst
Resource type	Contractor
Name of the indicator	FTE

	January 2025	February 2025	March 2025	April 2025	May 2025	June 2025	July 2025	etc.
Total	*******	*******	*******	*******	*******	*******	*******
Functional block 1
System analysis	*******	*******
RAW Migration
DDS Migration				*******	*******	*******	*******
BaseMart Migration
Functional block 2
System analysis	*******	*******	*******
RAW Migration
DDS Migration					*******	*******	*******
BaseMart Migration
Functional block 3
System analysis
RAW Migration
DDS Migration					*******	*******	*******
BaseMart Migration
etc.

The migration of the display layer begins after the end of the core migration and is based on the following principles:

Each subsequent pool depends on all previous ones and does not include the labor costs of objects of previous pools.
In terms of migration, the pools are strictly sequential with a small “overlap” per month.
This “overlap” is due to the start of work on systems analysis before the work on the previous pool of showcases is completed.
The introduction of the showcase functionality into operation is done in stages: one pool after another, and not all the functionality at once at the end.

In theory, all objects can be put into commercial operation at once at the end. But it is important to proceed from common sense, since on one side of the scale there are deadlines, and on the other - the need to attract a large number of specialists at once and the difficulties of managing an overly large team. And a step-by-step migration allows you to introduce and disconnect parts of a huge data warehouse with the least risks and reduce the load on technical support at the moment.

Showcase Layer Migration Roadmap:

Scenario	Conservative
Name of the indicator	FTE

	December 2025	January 2026	February 2026	March 2026	April 2026	May 2026	June 2026	etc.
Pool 1	*******	*******	*******
Pool 2		*******	*******	*******
Pool 3			*******	*******	*******
Pool 4				*******	*******	*******
Pool 5					*******	*******	*******
Pool 6						*******	*******
Pool 7							*******
Pool 8
etc.

Our roadmap at all its stages is built in the context of several scenarios, assessing various types of resources and roles.

Distinctive features of our plan:

The functionality of the old storage is being phased out in the same way as the functionality of the new, import-substituting storage is being introduced. This allows the team’s workload to be distributed over time.
Planning is detailed down to each role for all types of work at each layer of the data warehouse.
Our plan can be easily recalculated when the volume and composition of work changes.

The benefits of our roadmap

This approach allowed us to formulate criteria for the step-by-step migration of functionality and the start of its operation on the new platform. For each pool, a list of reports and flows with consumer systems is known, ready to be moved to the target platform after the migration of a specific pool.

The phased migration allowed us to decommission some of the functionality on the old platform earlier and save money.

Our plan is flexible enough and can be easily recalculated when a number of parameters change:

Date of start of migration of the kernel functional block;
Duration of migration of the functional block of the kernel;
Date when sources are ready to be connected to the RAW layer.

Several scenarios of the plan were calculated depending on the readiness of the sources to be connected in RAW. The scenarios depended on the following options:

earliest date to start migration of RAW$ layer
latest RAW migration start date (the latest date we can start connecting sources).

So by changing some parameters you can tailor this roadmap to your storage.

Results

Forming a roadmap allowed us to identify and fix problems in the old storage, rather than transferring them to the new one. And to get the following benefits:

Manage migration as flexibly as possible, easily recalculate deadlines and key project milestones when changing the composition of objects or their migration priorities.
Make optimistic, realistic and pessimistic plans.
To see in advance the bottlenecks in the number of personnel required for migration.
Plan in detail the involvement of specialists for each role.
Form criteria for migrating each reporting block to a new import-independent stack.
And finally, to carry out subsequent migration in the shortest possible time with optimal use of resources.