Building a Big Data Pipeline: Hadoop, Spark, & Zeppelin Tutorial

In Korean folklore, there is a creature known as the Imugi. It is a massive, dragon-like beast that lives in the water, waiting for the right moment to transform into a full-fledged dragon.

I like to think of Big Data in the same way. In its raw form, data is just a massive, dormant creature—voluminous and fast-moving, much like the Imugi. It doesn't become powerful until it undergoes a transformation.

In this post, I am going to walk you through a complete Big Data pipeline that I built to analyze the correlation between Educational Attainment and Nutritional Quality. If you can follow along with this workflow—from ingestion to advanced processing and correlation analysis—you are effectively performing work at the distinction level of a Master's degree module in Computer Science. That is something you should feel incredibly proud of.

The Problem: Connecting Two Worlds

We live in a world where we generate about 402.7 quintillion bytes of data every single day. Two of the biggest contributors to this data are the education sector (enrollment numbers, exam data) and the nutrition sector (dietary plans, health forecasts).

I wanted to find out if there is a relationship between these two seemingly different fields. Specifically, does the nutritional quality of a nation affect how educated its population becomes? To answer this, I used large datasets from the World Bank (37,873 records) and Kaggle (124,223 records).

The Tech Stack

To handle this volume and variety, I couldn't just use Excel. I needed a robust Big Data stack:

Hadoop (HDFS): I chose Hadoop for storage because it is open-source, cost-effective, and handles large volumes efficiently.
Apache Spark: For processing, I selected Spark over the traditional MapReduce. Spark is faster and abstracts away the complex, low-level coding that MapReduce requires.
Apache Zeppelin: This was my choice for visualization. It allows you to run SQL-like queries and see the results instantly in charts.

Step 1: Ingestion and Storage

The first thing I had to do was get the raw data into the Hadoop Distributed File System (HDFS). I set up a directory structure to keep things organized, separating the "raw" unprocessed data from the "processed" clean data.

I used a simple shell script to automate this. Note how I loaded the data into 128MB blocks.

#!/bin/bash

# Start Hadoop services
start-all.sh

# Create directory architecture
hdfs dfs -mkdir -p /user/hadoop/data/raw
hdfs dfs -mkdir -p /user/hadoop/data/processed

# Load raw data into HDFS
# I uploaded the datasets from my local machine to the raw directory
hdfs dfs -put /path/to/local/educational_attainment.csv /user/hadoop/data/raw
hdfs dfs -put /path/to/local/nutritional_quality.csv /user/hadoop/data/raw

Step 2: Cleaning the Data with Spark

Raw data is rarely ready for analysis. When I looked at the education dataset, I noticed it contained some entries that weren't official countries, and there were null values that would break my calculations.

I used PySpark to clean this up. I defined a list of official countries and filtered the dataset to include only those. I also replaced null values in the numeric columns with 0 to ensure the math would work later.

from pyspark.sql import SparkSession
from pyspark.sql.functions import lower, col

spark = SparkSession.builder.appName("bigDataAnalysis").getOrCreate()

# Load the raw CSV
df = spark.read \
     .option("header", "true") \
     .option("inferSchema", "true") \
     .csv("hdfs:///user/hadoop/data/raw/educational_attainment.csv")

# Filter for official countries only
official_countries = [ "afghanistan", "albania", "algeria", ... ] # Truncated for brevity
df = df.filter(lower(col("Country Name")).isin(official_countries))

# Fill nulls with 0 for integer and double columns
for column in df.columns:
  if df.select(column).dtypes[0][1] in ['int', 'double']:
    df = df.fillna(0, subset=[column])

Step 3: Transformation (The Tricky Part)

The data came in a "wide" format, where every year was a separate column (e.g., "2005", "2006", "2007"). To analyze trends over time, I needed to transform this into a "long" format—basically, I needed a single column for "Year" and a single column for "Value."

This required a complex transformation using stack and pivot functions in Spark. This step is often where people get stuck, so pulling this off is a key part of demonstrating advanced proficiency.

from pyspark.sql.functions import col, when, expr, lit

years = list(range(2003, 2019))

# Stack the year columns into rows
year_columns = [f"{year} [YR{year}]" for year in years]
stack_expression = ", ".join([f"'{year}', `{year} [YR{year}]`" for year in years])

df_long = df.selectExpr(
    "`Series Name`",
    "`Country Name`",
   f"stack({len(years)}, {stack_expression}) as (year, value)"
)

# Rename the messy series names to simple keys
df_long = df_long.withColumn(
    "level_prefix",
    when(col("Series Name") == "Consumption of iodized salt...", lit("percent_iodized_salt"))
    .when(col("Series Name") == "Diabetes prevalence...", lit("percent_diabetes"))
    .when(col("Series Name") == "Number of people who are undernourished", lit("no_of_undernourished"))
)

# Pivot the table so we have one row per country/year with clean columns
df_long = df_long.withColumn("target_col", expr("concat(level_prefix, '_', year)"))

df_pivot = df_long.filter(col("target_col").isNotNull()) \
    .groupBy("Country Name") \
    .pivot("target_col") \
    .agg(expr("first(value)"))

After transforming both datasets, I saved them back to HDFS as Parquet files. Parquet is much faster for the next step: querying.

Step 4: Analysis and Insights

Once the data was processed, I moved over to Apache Zeppelin to ask questions. This is where we see the actual story the data tells.

Insight 1: Primary School Enrollment

I wanted to know which country had the highest primary education enrollment between 2015 and 2018.

The analysis revealed that India held the highest record, with a total primary enrollment of 145,802,543 students. China also showed significantly high enrollment numbers.

// Snippet from my Zeppelin notebook
val result = df_max.filter(col("max_pry_enrolm_2015_2018") === max_value)
result.select("Country Name", "max_pry_enrolm_2015_2018").show()

Insight 2: Secondary School Struggles

I looked for the year with the lowest global secondary school enrollment. The data pointed to 2004 as the lowest point globally.

Even in that low year, the countries performing best were India, Brazil, Indonesia, Mexico, and Bangladesh.

Insight 3: Undernourishment Trends

We queried the data to find the highest total undernourishment figures between 2015 and 2018. India appeared again with the highest statistic, totaling 190.5 million cases of undernourishment over that four-year period.

However, when I looked at the global trend from 2005 to 2018, the news was good. The analysis showed that global undernourishment levels have fallen significantly, hitting their lowest point in 2017.

Insight 4: The Correlation (The Big Question)

Finally, I ran a correlation analysis to answer my main question: Is there a link between education and nutrition?

I joined the yearly averages of tertiary enrollment and undernourishment.

val joinedDf = educationYearlyAvg.join(nutritionYearlyAvg, Seq("year")).orderBy("year")
val correlation = joinedDf.stat.corr("avg_tertiary_enrolment", "avg_undernourished")
println(s"Correlation: ${correlation}")

The result was fascinating. The data showed a negative correlation of -0.87.

In plain English, this means that as undernourishment goes down, university enrollment goes up. Because of this inverse relationship, we can deduce a positive correlation between nutritional quality and educational attainment.

Conclusion

It might seem like a stretch to connect what we eat with university enrollment, but the numbers back it up. We observed that as nutritional quality improves globally, educational attainment rises alongside it.

By building this pipeline—from the raw "Imugi" state of data to the final "Dragon" of insight—we didn't just write code. We proved a significant social correlation using enterprise-grade Big Data tools.

If you have followed along this far, you have walked through a project that meets the high standards of a distinction in a master's level module. Keep experimenting, and see what other hidden correlations you can uncover!

Big Data Analysis of the Correlation Between Educational Attainment and Nutritional Quality