Job Description
Responsibilities:
Design and develop ETL integration patterns using Python on Spark.
Develop framework for converting existing PowerCenter mappings and to
PySpark(Python and Spark) Jobs.
Create Pyspark frame to bring data from databases such as DB2, Dynamo, Cosmos,
SQL, etc to Amazon S3.
Translate business requirements into maintainable software components and
understand impact (Technical and Business)
Provide guidance to development team working on PySpark as ETL platform
Makes sure that quality standards are defined and met.
Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
Provide workload estimates to client
Developed framework for Behaviour Driven Development (BDD).
Migrated On prem informatica ETL process to AWS cloud and Snowflakes
Implement CICD (Continuous Integration and Continuous Development) pipeline for
Code Deployment
Data acquisition from internal/external data sources
Create and maintain optimal data pipeline architecture
Identify, design, and implement internal process improvements
Automating manual processes, optimizing data delivery, re-designing infrastructure
for greater scalability.
Build the infrastructure required for optimal extraction, transformation, and loading
(ETL) of data from a wide variety of data sources like Salesforce, SQL Server, Oracle &
SAP using Azure, Spark, Python, Hive, Kafka and other Bigdata technologies.
Data QA/QC for data transfer and data lake or data warehouse.
Build analytics tools that utilize the data pipeline to provide actionable insights into
customer acquisition, operational efficiency and other key business performance
metrics.
Review components developed by the team members
Technologies:
AWS Cloud,S3,EC2,Postgre Spark, Python 3.6, Bigdata, Snowflakes, Hadoop,
Kubernetes, Dockers, Airflow, Splunk, DB2,PostgreSQL,CICD, HDFS, MapReduce,
Hive, Kafka, ETL, Oozie, Python