As part of a digital transformation strategy, VTB Bank develops data analysis tools for risk management and support of any processes where machine learning algorithms can bring efficiencies, and big data processing is required. Using a stack of Hadoop technologies, a storage based on the Data Lake concept has been deployed for data aggregation and processing in an unstructured and semi-structured ingested from the corporate DWH and other internal sources, as well as data from an external source — Pravo.ru.
Automation of credit risk factors monitoring for corporate customers became the first practical task addressed using the Data Lake. A total of 20 indicators of credit risk factors (CRF) are calculated every day, visualization of customer ratings based on the calculated CRF is implemented. Additionally, the bank credit analysts are able to look through detailed information on a customer using BI tools. Importantly, the work with data is carried out by business users without involving the IT department.
Moreover, the Data Lake and analytical sandboxes created on its basis offer Data Scientists the opportunity to make productive mistakes, quickly test hypotheses and make changes in a model using Agile methods.
SolutionData Lake is built based on the Cloudera CDH platform, data loading is implemented using Apache Oozie, Apache Spark, and Apache Sqoop. Data fr om backend banking systems and external sources is kept in a self-descriptive format, such as JSON and Apache Parquet. The unified access to the raw data and data marts using SQL is implemented by means of Apache Spark SQL and Apache Impala technologies.
Machine and deep learning tools, such as scikit-learn, Apache Spark MLLib, H2O, TensorFlow, and keras, have been developed and introduced to address the tasks of data exploration. The Apache Zeppelin and JupiterHub tools of data research and visualization have been also introduced, wh ere data scientists have all the necessary data from the Data Lake and data analytics libraries. Users now have access to a unified space of larger amounts of more diverse data for deeper and more precise customer analysis using the high-performance scalable Apache Hadoop platform.
In order to implement the project in 3 months, it was decided to implement a process based on the Scrum methodology, using DevOps tools and principles. VTB Bank and Neoflex experts collaborated in a unified Scrum team under the supervision of a Scrum coach. Thanks to the close cooperation of engineers, they were able to create the entire necessary infrastructure for development, deployment and operation of software, as well as to build the processes of continuous pipeline-based delivery of updates (CI/CD).
The creation of Data Lake made it possible to collect, aggregate in a unified space and process data from diverse sources. The infrastructure ensures the high performance and quality of credit risk factors monitoring for the bank's corporate customers, provides tools for comprehensive data analysis and visualization, as well as forecasting and development of new models.
Thanks to Hadoop, the development and scaling of the solution does not require capital investments as opposed to building a data warehouse by means of traditional technologies.