A Data Lake for monitoring of credit risk factors and data analysis using Data Mining tools based on Hadoop technologies
VTB Bank PJSC is a core organization of VTB Banking Group and one of the largest banks in Russia.
Details about the Client

Project Summary

As part of a digital transformation strategy, VTB Bank develops data analysis tools for risk management and support of any processes where machine learning algorithms can bring efficiencies, and big data processing is required. Using a stack of Hadoop technologies, a storage based on the Data Lake concept has been deployed for data aggregation and processing in an unstructured and semi-structured ingested from the corporate DWH and other internal sources, as well as data from an external source — Pravo.ru.

Automation of credit risk factors monitoring for corporate customers became the first practical task addressed using the Data Lake. A total of 20 indicators of credit risk factors (CRF) are calculated every day, visualization of customer ratings based on the calculated CRF is implemented. Additionally, the bank credit analysts are able to look through detailed information on a customer using BI tools. Importantly, the work with data is carried out by business users without involving the IT department.

Moreover, the Data Lake and analytical sandboxes created on its basis offer Data Scientists the opportunity to make productive mistakes, quickly test hypotheses and make changes in a model using Agile methods.

Solution

Data Lake is built based on the Cloudera CDH platform, data loading is implemented using Apache Oozie, Apache Spark, and Apache Sqoop. Data fr om backend banking systems and external sources is kept in a self-descriptive format, such as JSON and Apache Parquet. The unified access to the raw data and data marts using SQL is implemented by means of Apache Spark SQL and Apache Impala technologies.
Machine and deep learning tools, such as scikit-learn, Apache Spark MLLib, H2O, TensorFlow, and keras, have been developed and introduced to address the tasks of data exploration. The Apache Zeppelin and JupiterHub tools of data research and visualization have been also introduced, wh ere data scientists have all the necessary data from the Data Lake and data analytics libraries. Users now have access to a unified space of larger amounts of more diverse data for deeper and more precise customer analysis using the high-performance scalable Apache Hadoop platform.

In order to implement the project in 3 months, it was decided to implement a process based on the Scrum methodology, using DevOps tools and principles. VTB Bank and Neoflex experts collaborated in a unified Scrum team under the supervision of a Scrum coach. Thanks to the close cooperation of engineers, they were able to create the entire necessary infrastructure for development, deployment and operation of software, as well as to build the processes of continuous pipeline-based delivery of updates (CI/CD).


Outcomes

The creation of Data Lake made it possible to collect, aggregate in a unified space and process data from diverse sources. The infrastructure ensures the high performance and quality of credit risk factors monitoring for the bank's corporate customers, provides tools for comprehensive data analysis and visualization, as well as forecasting and development of new models.

Thanks to Hadoop, the development and scaling of the solution does not require capital investments as opposed to building a data warehouse by means of traditional technologies.

Interview

Maxim Kondratenko
Maxim Kondratenko
We have gained a proven experience in integrating the siloed external and internal data sources into a unified environment to improve the quality and performance of risk assessment by way of using combined approaches to analysis and information processing: from the classical statistical analysis to machine learning methods, taking advantages of open-source technologies and developing project management competences. This pilot project required not so much financial investment, as readiness to change the established processes, mindsets and culture. A lot of things point to the fact that we created agility necessary to enable the ongoing improvement, experimenting and innovation.

Back to the list of stories