Exploratory Data Analysis with Python

October 6, 20245 min read

Exploratory Data Analysis (EDA) is a crucial step in the data science process, allowing analysts to uncover patterns, spot anomalies, and test hypotheses. This guide delves into the techniques and tools used in EDA, with a focus on Python's capabilities.

Deep Dive: EDA with Python

Exploratory Data Analysis (EDA) Using Python

A Comprehensive Guide to Data Sanity Checks and EDA

Before venturing into advanced data analysis or machine learning, it's essential to ensure that the data you're working with is clean and coherent. This article outlines the process of conducting Data Sanity checks and Exploratory Data Analysis (EDA), both of which are critical first steps in understanding your dataset.

While the initial stages don't involve modifying the data, these actions help uncover potential issues such as missing values, duplicates, and outliers, while providing valuable insights into the data's structure and relationships.

It's important to start by inspecting the dataset to familiarize yourself with its contents and structure. Understanding the data is key to identifying any problems that might affect your analysis. Once this preliminary examination is complete, you can decide whether data cleaning or preprocessing is needed.

Key Focus Areas

Data sanity checks
Exploratory data analysis
Missing value detection
Outlier identification

Table of Contents

Understanding the Dataset
Data Sanity Checks
Exploratory Data Analysis (EDA)
Automating Univariate and Bivariate Analysis
Plan for Data Cleaning and Preprocessing
Conclusion
EDA FAQ
Summary Checklist
Glossary

Understanding the Dataset

Recently, I worked with a nutritional dataset from Kaggle using Google Colab, which allows for writing and executing Python code in a browser-based Jupyter notebook. The objective was to analyze nutrient patterns across a range of foods and categorize them based on their nutrient content.

Kaggle is a fantastic resource for data scientists and enthusiasts, offering a wealth of datasets across different fields. Whether you're looking to explore healthcare, finance, or more specialized areas like nutrition, Kaggle provides an excellent platform to practice data analysis and machine learning techniques.

Dataset Details: The dataset I explored included nutritional values for various foods and products, detailing their protein, fat, vitamin C, and fiber content. You can find the full code and analysis in my Google Drive folder.

This EDA work was part of the UT Austin AI/ML program and serves as the foundation for a companion article: once the data was cleaned and understood, the next step was applying K-means clustering to discover natural groupings in the foods by nutrient profile. See Exploring Nutritional Data with K-means Clustering for how that analysis unfolded.

Data Sanity Checks

Before diving into EDA, it's crucial to perform data sanity checks to ensure the dataset's integrity:

Check for Missing Values: Identify and handle missing data appropriately.
Check for Duplicates: Remove duplicate entries to avoid skewed results.
Data Types Verification: Ensure data types are correct for each column.
Range Validation: Verify that numerical values fall within expected ranges.
Consistency Checks: Ensure data is consistent across related fields.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a method used by data scientists to analyze datasets and summarize their main characteristics, often using visual methods. It is a critical step in understanding the data before proceeding with more complex analyses or modeling.

Why EDA?

Identify Patterns: EDA helps in identifying patterns and relationships in data.
Spot Anomalies: It allows for the detection of outliers and anomalies that might skew the analysis.
Hypothesis Testing: EDA provides a foundation for hypothesis testing and further statistical analysis.

Tools and Libraries

Python offers several libraries that are essential for conducting EDA:

Pandas: For data manipulation and analysis.
Matplotlib: For creating static, interactive, and animated visualizations.
Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive statistical graphics.
NumPy: For numerical data processing.

Visualization Techniques

Histograms: To understand the distribution of a single variable.
Scatter Plots: To identify relationships between two variables.
Box Plots: To visualize the spread and identify outliers.
Heatmaps: To display the correlation between variables.

Automating Univariate and Bivariate Analysis

During your Data Sanity checks, it's essential to classify your variables into numerical, categorical, and dependent types before starting your Exploratory Data Analysis (EDA).

In the early stages of data analysis, you will often need to determine whether your variables are numerical, categorical, or dependent. Identifying these is crucial for:

Performing the correct statistical methods on your data
Automating your exploratory analysis using scripts
Generating meaningful insights into relationships between features

Once these variables are classified, you can begin the process of performing univariate (analyzing one variable) and bivariate (analyzing relationships between two variables) analysis. Automating this process will save you time and ensure consistency in your Exploratory Data Analysis (EDA).

Numerical Univariate Analysis

This function calculates the key statistical attributes for a numerical feature, including mean, median, variance, and skewness. It also provides visual insights using KDE plots, BoxPlots, and Histograms.

def univariate_analysis(df, features):
    for feature in features:
        skewness = df[feature].skew()
        minimum = df[feature].min()
        maximum = df[feature].max()
        mean = df[feature].mean()
        mode = df[feature].mode().values[0]
        unique_count = df[feature].nunique()
        variance = df[feature].var()
        std_dev = df[feature].std()
        percentile_25 = df[feature].quantile(0.25)
        median = df[feature].median()
        percentile_75 = df[feature].quantile(0.75)
        data_range = maximum - minimum

        print(f"Univariate Analysis for {feature}")
        print(f"Skewness: {skewness:.4f}")
        print(f"Min: {minimum}")
        print(f"Max: {maximum}")
        print(f"Mean: {mean:.4f}")
        print(f"Mode: {mode}")
        print(f"Unique Count: {unique_count}")
        print(f"Variance: {variance:.4f}")
        print(f"Std Dev: {std_dev:.4f}")
        print(f"25th Percentile: {percentile_25}")
        print(f"Median (50th Pct): {median}")
        print(f"75th Percentile: {percentile_75}")
        print(f"Range: {data_range}")

        plt.figure(figsize=(18, 6))
        plt.subplot(1, 3, 1)
        sns.kdeplot(df[feature], fill=True)
        plt.title(f"KDE of {feature}")
        plt.subplot(1, 3, 2)
        sns.boxplot(df[feature])
        plt.title(f"Box Plot of {feature}")
        plt.subplot(1, 3, 3)
        sns.histplot(df[feature], bins=10, kde=True)
        plt.title(f"Histogram of {feature}")
        plt.tight_layout()
        plt.show()

This function provides a comprehensive analysis for each numerical feature by calculating statistical attributes and generating KDE, BoxPlot, and Histogram visualizations.

Categorical Univariate Analysis

For categorical features, we analyze the distribution of categories and their relationship with the dependent feature. Here's a function that automates this process:

def univariate_analysis_categorical(df, categorical_features):
    for feature in categorical_features:
        unique_categories = df[feature].nunique()
        mode = df[feature].mode().values[0]
        mode_freq = df[feature].value_counts().max()
        category_counts = df[feature].value_counts()
        category_percent = df[feature].value_counts(normalize=True) * 100
        missing_values = df[feature].isnull().sum()
        total_values = len(df[feature])
        imbalance_ratio = category_counts.max() / total_values

        print(f"Univariate Analysis for {feature}")
        print(f"Unique Categories: {unique_categories}")
        print(f"Mode (Most frequent): {mode}")
        print(f"Frequency of Mode: {mode_freq}")
        print(f"Missing Values: {missing_values}")
        print(f"Imbalance Ratio (Max/Total): {imbalance_ratio:.4f}")
        print(f"Category Counts:\n{category_counts}")

        plt.figure(figsize=(10, 6))
        sns.countplot(x=df[feature], order=df[feature].value_counts().index)
        plt.title(f"Frequency of {feature} Categories")
        plt.xlabel(feature)
        plt.ylabel("Count")
        plt.tight_layout()
        plt.show()

This function provides a clear understanding of how categories are distributed across the data and helps identify potential imbalances.

Automating Bivariate Analysis

Bivariate analysis allows you to understand the relationship between two variables. The following function calculates key attributes for a numerical feature in relation to a boolean dependent feature. It prints out key insights and generates side-by-side visualizations to understand their relationship.

def bivariate_analysis(df, numerical_features, categorical_features, dependent_feature):
    if numerical_features:
        for feature in numerical_features:
            mean_0 = df[df[dependent_feature] == 0][feature].mean()
            mean_1 = df[df[dependent_feature] == 1][feature].mean()
            median_0 = df[df[dependent_feature] == 0][feature].median()
            median_1 = df[df[dependent_feature] == 1][feature].median()
            var_0 = df[df[dependent_feature] == 0][feature].var()
            var_1 = df[df[dependent_feature] == 1][feature].var()

            print(f"Mean {feature} for group 0: {mean_0:.2f}")
            print(f"Mean {feature} for group 1: {mean_1:.2f}")
            print(f"Median {feature} for group 0: {median_0:.2f}")
            print(f"Median {feature} for group 1: {median_1:.2f}")
            print(f"Variance of {feature} for group 0: {var_0:.2f}")
            print(f"Variance of {feature} for group 1: {var_1:.2f}")

            fig, axes = plt.subplots(1, 2, figsize=(16, 6))
            sns.boxplot(x=df[dependent_feature], y=df[feature], ax=axes[0])
            axes[0].set_title(f"{feature} Distribution by {dependent_feature}")
            sns.barplot(x=df[dependent_feature], y=df[feature], estimator='mean', ax=axes[1])
            axes[1].set_title(f"Mean {feature} by {dependent_feature}")
            plt.tight_layout()
            plt.show()

    if categorical_features:
        for feature in categorical_features:
            category_distribution = df.groupby([feature, dependent_feature]).size().unstack(fill_value=0)
            chi2, p, dof, expected = chi2_contingency(category_distribution)

            print(f"Chi-Square Test for {feature}: Chi2 = {chi2:.4f}, p-value = {p:.4f}")
            fig, axes = plt.subplots(1, 2, figsize=(16, 6))
            sns.countplot(x=df[feature], hue=df[dependent_feature], ax=axes[0])
            axes[0].set_title(f"{feature} Count by {dependent_feature}")
            sns.barplot(x=df[feature], y=df[dependent_feature], estimator='mean', ax=axes[1])
            axes[1].set_title(f"Proportion of {dependent_feature} by {feature}")
            plt.tight_layout()
            plt.show()

This function performs bivariate analysis by calculating key attributes and generating box plots, bar plots, and count plots to help you better understand the relationship between variables.

Plan for Data Cleaning and Preprocessing

By conducting thorough data sanity checks and EDA, we lay a strong foundation for further analysis. With a clear understanding of the data, the next steps could include feature engineering, advanced visualizations, or machine learning.

Conclusion

Exploratory Data Analysis is an indispensable part of the data analysis process. By leveraging Python's robust libraries, analysts can gain deep insights into their data, paving the way for more informed decision-making.

As you continue your journey in data science, remember that EDA is not just a preliminary step but a continuous process of discovery. Utilize Python's tools to enhance your analytical capabilities and drive impactful insights.

The Ongoing Practice of EDA

What I keep coming back to with EDA is that it's never really a finished step. Even after you've moved into modeling or analysis, the exploratory mindset — questioning assumptions, looking for patterns that don't fit — continues to pay off. Python's ecosystem makes this particularly natural, with libraries like pandas and seaborn lowering the barrier between having a question and getting a visual answer.

The real value of EDA isn't the charts and statistics themselves. It's the judgment you develop about which questions to ask next, and which anomalies deserve investigation rather than dismissal.

EDA FAQ

What is Exploratory Data Analysis (EDA)?

EDA is the process of analyzing datasets to summarize their main characteristics, often using visual methods.

Why is EDA important?

EDA helps you understand your data, detect anomalies, test assumptions, and prepare for modeling.

What are common EDA techniques?

Common techniques include summary statistics, visualizations (histograms, boxplots, scatter plots), and correlation analysis.

How do I handle missing data in EDA?

Identify missing values, then decide whether to remove, impute, or flag them based on context.

What Python libraries are best for EDA?

pandas, matplotlib, seaborn, and missingno are popular libraries for EDA in Python.

Summary Checklist

✅ Data sanity checks and EDA steps explained
✅ Complete Python code samples for automated analysis
✅ Univariate and bivariate analysis functions provided
✅ Visual techniques and statistical methods covered
✅ Table of Contents, FAQ, and Glossary included
✅ Real-world dataset example with Google Colab workflow

Glossary

EDA (Exploratory Data Analysis)

Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, often using visual methods.

For more, see the Wikipedia article on EDA.

Outlier

An outlier is a data point that differs significantly from other observations in a dataset.

Learn more at Wikipedia: Outlier.

IQR (Interquartile Range)

The interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles.

See Wikipedia: Interquartile range for details.

Skewness

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

More info: Wikipedia: Skewness.

Pandas

Pandas is a powerful open-source Python library for data manipulation and analysis, providing flexible data structures like DataFrames.

See Wikipedia: Pandas (software).

Explore More Data Science Articles

Related posts

Harnessing NLP: Concepts and Real-World Impact

Harnessing NLP: Concepts and Real-World Impact

A deep exploration of Natural Language Processing—its core techniques, the distinction between NLP and LLMs, real-world applications across industries, and a timeline of key milestones from the Turing Test to GPT-4.

Jan 26, 202515 min

Computer Vision in Machine Learning

Computer Vision in Machine Learning

Computer vision is revolutionizing industries by enabling machines to interpret visual data. This article explores its applications, challenges, and future potential.

Jan 21, 20255 min

Understanding Neural Networks

Understanding Neural Networks

Neural networks are a cornerstone of modern artificial intelligence, mimicking the way human brains operate to process information. This guide aims to introduce the basic concepts of neural networks, their architecture, and their applications.

Dec 23, 20245 min