Big Data Warehouse Migration Roadmap
Technology

Big Data Warehouse Migration Roadmap

15 min read

Recently, we solved the problem of data warehouse migration at work. It is one of the largest in the industry, at least in our country. It turned out that even at the planning stage, everything is not as simple as it seems. I share the problems, their solutions and the resulting roadmap.

Introduction

It became clear in 2024 that the storage needed to be migrated to a new platform. The migration could no longer be postponed due to the need to further scale the solution and increase the volume of data. It was impossible to do this on the existing Western proprietary platform due to sanctions.

Therefore, we decided to migrate to one of the Russian platforms. I deliberately do not name which one, since it is not of fundamental importance for our task. The approaches outlined can be used to plan migration from any data platform to any other.

The team actively participated in this task with me:

In order to move and not lose anything along the way, we started with a survey of the storage facility and built a detailed migration plan. Let’s talk about this task in more detail.

To achieve this, we collected detailed information about each significant object, including corporate sandboxes. There were a lot of all storage objects, including sandboxes — more than 9,000.

In order to preserve the functionality familiar to business users during migration, we identified the links between objects and their dependencies on each other. There were even more links than objects. We got a real web, including objects of source systems and links to them.

The relationship tree can be represented (and is often convenient) as a table. I will give an example of such a table in the section “Surveying the data warehouse”.

Initial data

We started building a very detailed plan, detailing it down to a specific data warehouse object. And already at this stage, problems started to emerge. It turned out that our huge data warehouse, about 1 PTb in size, had some of its functionality not described. Of course, there is nothing surprising about this. Any data warehouse, especially if it has been developing for decades, accumulates problems, and we are no exception. In addition to insufficient description, we have:

It was difficult to understand how to approach this. And we decided to start with restoring the description of object dependencies inside the repository.

Survey

First, we identified inaccuracies in the existing description. The problem was that the necessary data was scattered across different business units. It was impossible to gather all the employees who had the necessary information and immediately get everything we needed from them. So we created questionnaires. Below is an example of the information we collected for each consumer/report system.

Provided information about the report

Report ID

Report Title

Purpose of the report

Business area of ​​the report

Link to BI portal

Report Owner Department

Contact information for the report owner

Data readiness time requirements

Data storage depth

Data refresh rate

Data update depth

Data Quality Requirements

Requirements for separation of access rights

Additional requirements

Information about the data sources for the report, detailed down to attributes

Report ID

Report Title

Report indicator

The storage object corresponding to the report indicator and the attribute of this object

Questionnaire about the consumer system

System name

Subsystem name

Business area of ​​the system

Purpose of the system (in what processes it participates) 

The unit that owns the system

Contact information about the system owner

Information about data sources for the consumer system, detailed down to storage objects

System name

Subsystem name

Storage object

Data readiness time requirements

Data storage depth

Data refresh rate

Data update depth

We had to make several templates, since the requested data for different business units also differed. But with the help of the survey, we managed to find out which data warehouse objects are used as sources of corporate reporting, in integration with consumer systems and for other purposes. As a result, we got a list of objects on which each business functionality is based. So there was one problem less.

After that, it was possible to deal with the second problem. The existing data lineage was not complete, and it was necessary to close the “gaps” in it. To do this, based on the existing metadata, a universal algorithm was developed for forming a dependency tree (data lineage) for any storage object to the data source.

We have presented the dependency tree as a table. Below is a schematic example of such a table without mentioning the names of specific layers, each company will have its own set of them:

Source system

Storage Layer 1

Storage Layer N

System - consumer / Report

Source object

Layer 1 object

Layer N object

Report

etc.

In the process, thanks to the algorithm, we identified and eliminated “gaps” in the data lineage of some objects. And the improvement of the technical quality of metadata became a pleasant side effect of our work.

The algorithm also identified a significant number of showcases that were being formed in a non-industrial way, i.e. in “sandboxes”. This allowed us to clarify the scope of the work, and “industrializing” such showcases became one of the tasks of the migration project.

Thus, one of the layers of our storage became “sandboxes”.

Information from the data lineage made it possible to classify all storage objects as objects of one of the layers:

As a result, the survey allowed us to describe all the existing dependencies in the data warehouse, and we could move on.

Job Directory, Labor Drivers, and Halstead Metrics

We had to solve one more important problem and prepare a matrix of works required for migration. The list of works itself was obtained from numerous interviews with experts from the data management department and business units. It turned out to be a detailed list with full detailing of works at all stages of creating a data warehouse by layers.

A fragment of the list of works (as an example) is given below:

Job

Development of a logical data model

Architectural review of the logical data model

Coordination of the logical data model

Creating a data transformation prototype

Filling source-to-target

Developing data loading and transformation flows

Conducting acceptance tests

Deploying a release to production

etc.

For each type of work in the list, we have defined a driver, that is, an entity that can be counted and, depending on which, labor costs can be measured.

Example of a connection between a list of works and drivers:

JobDriver

Create LMD in Power Designer

Number of DDS table attributes

Form S2T

Number of DDS tables

Form requirements/methodology for forming showcases

Number of business showcases

etc.

 

Labor cost metrics were defined:

Now, with an accurate list of storage objects, types of work and drivers, as well as metrics by roles, we could calculate labor costs.

It is worth noting that for most types of work, labor costs metrics are set by experts in the form of constants. But there is such a type of work as creating a prototype in the form of SQL code based on the analysis of the code of stored procedures in the current KHD. In this case, labor costs will depend on the “complexity” of the code (number of entities, attributes, number of operations) and cannot be set as a constant.

To assess the complexity of the code and calculate the labor costs for its migration, Alexey Kulagin and Sergey Strelnikov proposed using and implemented an algorithm based on Halstead metrics, adapting it for SQL.

SEL
Cnt_method AS calc_name,tablekind AS calc_type,Count(DISTINCT CASE WHEN token_type = 'Operation' THEN metric_name ELSE NULL END)        AS nu_op     ,Count(DISTINCT CASE WHEN token_type in ( 'Operand','Common') THEN metric_name ELSE NULL END)   AS nu_opd,Sum(CASE WHEN token_type = 'Operation' THEN metric_value ELSE NULL END) AS n_sm_op
,Sum(CASE WHEN token_type in ( 'Operand','Common') THEN metric_value ELSE NULL END) AS n_sm_opd
,nu_op+nu_opd AS n_dict
,n_sm_op+n_sm_opd AS n_length
,n_length*(Log(n_dict)/Log(2)) AS v_length
,(nu_op*n_sm_opd)/(2*nu_opd) AS d_length
,d_length*v_length AS e_length
,18 AS k_psy
,e_length/k_psy AS t_length
,t_length/3600 AS cost_mh
,NULL AS ctg_complex_name
FROM migration.t_obj_complexity

Using well-known industry metrics made it easier for all stakeholders to agree on the migration roadmap.

Now we had all the relationships within the repository and an assessment of all the work for each object by role. All that was left was to split the repository into blocks in such a way as to migrate in stages and consistently introduce the functionality into operation.

Splitting storage into migration blocks

We split the storage into layers, but this was not enough for detailed planning, since the storage is very large and different specialists at the business unit level are responsible for its different parts. Therefore, we began to split all storage objects into functional migration blocks:

We assigned all objects of the following layers RAW, DDS, BaseMart to a specific functional migration block.

It turned out to be impossible to split the BusinessMart showcase layer objects and SandBoxes into functional blocks. The reason is our storage. Any business showcase was assembled from data from different functional blocks. Also, unfortunately, during the preliminary survey we were unable to identify clear business priorities for migrating showcases.

Then we went the other way and split the showcase layer objects into what we called migration pools. Pooling worked like this:

Our survey showed which reports and consumer systems were “fed” by each storefront, so we could tell exactly which reports and consumer system flows could be migrated to the new platform after each specific pool was migrated.

This approach of dividing the showcases into pools is very important. It is the criterion for stage-by-stage migration when drawing up a roadmap.

This mechanical division of display cases, oddly enough, led to their unification into groups based on related features.

After we identified all the dependencies of objects, labor costs for each type of migration work and divided the objects into groups in which we will migrate. Finally, we could begin to form a migration roadmap.

The final migration roadmap

The final map has two large consecutive stages:

These stages actually proceed sequentially, with a small “overlap” of one month. That is, first the migration of the storage core is completely completed, and then the migration of the showcase layer begins.

The migration of the storage core is divided into functional blocks. Each block can be migrated independently. Therefore, in our case, some functional blocks were developed in parallel to reduce the overall duration of the migration project.

All work on kernel migration within the functional block was combined into four stages:

Schematically, the sequence of execution looks like this:

That is, the system analysis and migration of the RAW layer are carried out in parallel. Then the migrations of the DDS layer and the BaseMart layer are carried out sequentially.

The roadmap for the storage core migration looks like this:

ScenarioConservative
RoleSystems Analyst
Resource typeContractor
Name of the indicatorFTE

 

January 2025February 2025March 2025April 2025May 2025June 2025July 2025etc.
Total*** *** *** *** *** *** ***

 

Functional block 1

 

 

 

 

 

 

 

 

System analysis

******

 

 

 

 

 

 

RAW Migration

 

 


 

 

 

 

 

DDS Migration

 

 

 

************

 

BaseMart Migration

 

 

 

 

 

 

 

 

Functional block 2

 

 

 

 

 

 

 

 

System analysis

*********

 

 

 

 

 

RAW Migration

 

 

 


 

 

 

 

DDS Migration

 

 

 

 

*********

 

BaseMart Migration

 

 

 

 

 

 

 

 

Functional block 3

 

 

 

 

 

 

 

 

System analysis

 

 

 

 

 

 

 

 

RAW Migration

 

 

 


 

 

 

 

DDS Migration

 

 

 

 

*********

 

BaseMart Migration

 

 

 

 

 

 

 

 

etc.

 

 

 

 

 

 

 

 

The migration of the display layer begins after the end of the core migration and is based on the following principles:

In theory, all objects can be put into commercial operation at once at the end. But it is important to proceed from common sense, since on one side of the scale there are deadlines, and on the other - the need to attract a large number of specialists at once and the difficulties of managing an overly large team. And a step-by-step migration allows you to introduce and disconnect parts of a huge data warehouse with the least risks and reduce the load on technical support at the moment.

Showcase Layer Migration Roadmap:

ScenarioConservative
Name of the indicatorFTE

 

December 2025January 2026February 2026March 2026April 2026May 2026June 2026etc.

Pool 1

*********

 

 

 

 

 

Pool 2

 

*********

 

 

 

 

Pool 3

 

 

*********

 

 

 

Pool 4

 

 

 

*********

 

 

Pool 5

 

 

 

 

*********

 

Pool 6

 

 

 

 

 

******

 

Pool 7

 

 

 

 

 

 

***

 

Pool 8

 

 

 

 

 

 

 

 

etc.

 

 

 

 

 

 

 

 

Our roadmap at all its stages is built in the context of several scenarios, assessing various types of resources and roles.

Distinctive features of our plan:

The benefits of our roadmap

This approach allowed us to formulate criteria for the step-by-step migration of functionality and the start of its operation on the new platform. For each pool, a list of reports and flows with consumer systems is known, ready to be moved to the target platform after the migration of a specific pool.

The phased migration allowed us to decommission some of the functionality on the old platform earlier and save money.

Our plan is flexible enough and can be easily recalculated when a number of parameters change:

Several scenarios of the plan were calculated depending on the readiness of the sources to be connected in RAW. The scenarios depended on the following options:

So by changing some parameters you can tailor this roadmap to your storage.

Results

Forming a roadmap allowed us to identify and fix problems in the old storage, rather than transferring them to the new one. And to get the following benefits: