status-updates

Week 3 Status Updates

Monday: Let’s Get Started! 😎

Morning Session ⚡

Afternoon Deep Dive 💡

Evening Exploration 🔍


Overall, it was a productive day heavy on research can’t wait to start coding 🌟

Tuesday: Building Knowledge 📚

Another productive day full of learning and hands-on experience! 🚀


Wednesday: Midweek Development

Diving deeper into technologies.

Morning Knowledge-Sharing Session 📚

Our team gathered for an comprehensive discussion about distributed computing, with valuable insights from everyone:

Big Data Fundamentals

Distributed Computing History 🕰️

Sandeep Sir and Amit Sir guided us through:

HDFS Deep Dive 🗄️

Learned about Hadoop Distributed File System (HDFS):

MapReduce & Resource Management 🔄

Explored the fundamentals of distributed processing:

Data Storage Evolution 📊

Discussed different data storage paradigms:

AWS Billing Resolution Call 📞

Great News: Received a call from AWS Support regarding the unexpected billing issue!

Call Details 💬

Key Points Discussed 🔍

Support Representative’s Response 🤝

Resolution ✨

Evening Project Kickoff 🚀

Python Package Assignment

Started warming up for Amit Sir’s assignment:

Ended the day by laying the groundwork for tomorrow’s development 📦 ✨


Thursday: Implementation Day

Python Package Development Marathon 🏃‍♂️

Project Completion 🎯

Dedicated the entire day to completing the Python package assignment. It was an intensive learning experience!

Technical Implementation 🛠️

Documentation & Resources 📚

Key Learnings 💡

All the challenges, solutions, and implementation details are thoroughly documented in the GitHub repository. Check out:

A full day of coding, learning, and documentation!


Weekend: PySpark Flight Data Analysis Project

This weekend, I worked on an exciting PySpark project analyzing flight data! Here’s what I learned and accomplished:

Project Overview

Technical Skills Learned

  1. Repartitioning in PySpark
    • Learned how to better organize data across partitions
    • Used repartition() to control how data is split up
    • This helps make my queries run faster
  2. User Defined Functions (UDF)
    • Created my own custom functions in PySpark
    • Learned the proper syntax for UDFs
    • Used them to create new columns and transform data
    • Also compared them with native pyspark functions and got to know how they are slower than normal pyspark functions
  3. Broadcast Joins
    • Discovered how to make joins more efficient
    • Used broadcast joins when joining large and small tables
    • This really helps speed up my data processing

Data Analysis

Project Structure

You can check out my project here: PySpark Flight Data Analysis

This weekend project really helped me understand PySpark better and improved my data analysis skills!


👋 Sayonara! See you next week for more exciting learning adventures! ✨