Migrating On Premises Hadoop Infrastructure to GCP

Introduction

Our Client from Singapore was building a Stock Prediction Platform in which they collect data from various third-party services like Stocks Data from Quandl Stocks Service and Latest News about multiple companies from IBM Watson Discovery Service.

These Data Collection Services were running on their On-Premises VM’s, and they were writing data to Ceph Object Storage, and Then Airflow was deployed on one of the VM’s which schedules the Spark Jobs to collect new data files from Ceph Storage and run the transformations and store the data into Hive. Hadoop was also set up on their On-Premises VM’s.

Then REST API and SDK’s were made to provide access of the Hive Data Warehouse to Data Scientists, and Then their Prediction Algorithms run on TensorFlow and persist the results to MySQL.

Stock Prediction Dashboard was made on top of MySQL by consuming REST API’s. So Client wants to migrate all this Technology Stack to Google Cloud. So We started working on it by collaborating with Client.

Technology Stack

Node Js based Data Collection Services (on Google Compute Engines)

Google Cloud Storage found Data Lake (storing raw data coming from Data Collection Service)

Apache Airflow (Configuration & Scheduling of Data Pipeline which runs Spark Transformation Jobs)

Apache Spark on Cloud DataProc ( Transforming Raw Data to Structured Data )

Hive Data Warehouse on Cloud DataProc

Play Framework in Scala Language ( REST API )

Python based SDKs

Solution Offered

Steps used to build this Platform

We in collaboration with Client’s team understand their requirements like various data sources, data pipelines, etc. for the migration of their Platform from On-Premises to Google Cloud Platform.

Data Collection Services on Google Compute Engines

We migrated all of their Data Collection Services and REST API and other background services to Google Compute Engine ( VM’s).

Updating the Data Collection Jobs to write data on Google Buckets

Data Collections Jobs were developed in node.js and were writing data to Ceph Object Storage. So they were using Ceph as their Data Lake. So, Our Node.js developers updated their existing code to write the data to Google Buckets. So We used Google Buckets as our Data Lake.

Using Apache Airflow for building Data Pipelines and Building Data Warehouse using Hive and Spark

The client had already developed a set of Spark Jobs which runs every 3 hours and checks for new files in Data Lake ( Google Buckets ) and then run the transformations and store the data into Hive Data Warehouse. So We migrated their Airflow Data Pipelines to Google Compute Engines and also migrated the Hive on HDFS and we used Cloud DataProc Cluster for Spark and Hadoop.

Migrating REST API’s to Google Compute Instances

The REST API which was serving Prediction results to Dashboards and also acting as Data Access Layer for Data Scientists were also migrated to Google Compute Instances ( VM’s ).

What are you doing?

Talk to Experts for Assessment on DevOps Intelligence, Big Data Engineering and Decision Science

Reach Us

Transforming to a Data-Driven Enterprise

Get in Touch with us for Artificial Intelligence Platform and Enterprise Analytics Solution

Migrating On Premises Hadoop Infrastructure to Google Cloud Platform

Introduction

Technology Stack

Solution Offered

Steps used to build this Platform

Data Collection Services on Google Compute Engines

Updating the Data Collection Jobs to write data on Google Buckets

Using Apache Airflow for building Data Pipelines and Building Data Warehouse using Hive and Spark

Migrating REST API’s to Google Compute Instances

Looking For More Details

Category

Technologies

What are you doing?

Transforming to a Data-Driven Enterprise

AI & Deep Learning
Consulting Services

Migrating On Premises Hadoop Infrastructure to Google Cloud Platform

Introduction

Technology Stack

Solution Offered

Steps used to build this Platform

Data Collection Services on Google Compute Engines

Updating the Data Collection Jobs to write data on Google Buckets

Using Apache Airflow for building Data Pipelines and Building Data Warehouse using Hive and Spark

Migrating REST API’s to Google Compute Instances

Looking For More Details

Category

Technologies

What are you doing?

Transforming to a Data-Driven Enterprise

AI & Deep Learning Consulting Services

AI & Deep Learning
Consulting Services