Reddit Cloud Batch ELT

Tech Stack: Python, AWS (S3, Redshift), Docker, Google Data Studio, Apache Airflow, Reddit API
Github URL: Project Link

  • Developed and implemented an end-to-end ETL pipeline for extracting data from Reddit's API, processing over 10,000 posts within the past 24 hours, transforming it, and loading it into AWS S3 and Redshift.
  • Utilized Apache Airflow and Docker for streamlined orchestration and deployment, ensuring the pipeline runs smoothly and automatically, processing an average of 500 posts/minute and reducing manual intervention by 80%.
  • Leveraged dbt to connect to data warehouse and perform data transformations, executing complex SQL queries and implementing data modeling techniques, to optimize and enrich data sets ready for analysis and visualization.