Big Data for Development
Datasets are getting bigger and bigger as the world’s population grows and things get more and more connected. Traditional data processing software and techniques cannot handle the large-scale datasets. So, we need specialized frameworks and tools like Apache Spark to handle large datasets. This course teaches the essentials of procesing large-scale datasets using Python.
In addition, the course also teaches how to perform common computing tasks, such as managing data, and building machine learning models with Python. This course takes a a hands-on approach to equip participants with the most essential tools in a timely manner.
This course emphasizes practice-related learning, as such it includes many exercises to allow participants enough time to practice.
This course takes a hands-on approach of equipping participants with the most essential tools in a timely manner. Classes start with the fundamentals of Python and focus primarily on data structures, then move quickly to major libraries for data science in Python.
Next, the course moves on to big data processing by first providing brief theoretical concepts on the subject, then teaches Apache Spark, an advanced tool for processing large datasets. Afterwards, it offers introductory machine learning lectures before moving on to a detailed explanation of how to build these algorithms in Python. This course promotes learning by the hands on method.
- Understand the advanced concepts of the Python language: data structures, functions, classes, etc.
- Perform computerized tasks on data using Python language: data ingestion, processing, visualization, web retrieval etc.
- Process large-scale (20 GB+) data sets on a personal computer using Apache Spark and use ‘Cloud Computing’ platforms.
- Familiarize yourself with the theoretical bases of common machie learning algorithms.
- Be able to build and evaluate machine learning models using the ‘scikit-learn’ library.
Day 1: Advanced Concepts in Python. On this first day, the course will focus on the Python programming language to build a solid foundation for the rest of the course materials. Participants will be introduced to practical techniques from intermediate to advanced level, such as writing functions, classes, error handling, packing of Python code, and more.
Day 2: Python for Data Science: Day 2 focuses on performing common Data Science tasks using Python. We’ll explain how to use data, process, analyze, visualize, ‘Web Scraping’, and more using Python, while introducing essential packages (Pandas, Geopandas, Numpy, Matplotlib, etc) to perform these tasks.
Day 3: Big Data Handling: On the third day, the course covers handling large data sets using Python.
The following topics will be covered in addition to introduction to Big Data, multiprocessing in Python, Apache Spark, use of common cloud platforms etc.
Day 4: Machine Learning (ML) in Python. On the fourth day, the course will begin with an introductory lecture on Machine Learning. the remainder of the day will be spent completing various ML tasks (e.g data preparation, model building, evaluation and interpretation) using the scikit-learn package in Python.\
Day 5: Putting it all together: In the last day, we will focus on the skills learned in this course to solve real-world data science problems by examining case studies.
Potential case studies to cover include: how to process nighttime satellite images(geo-spatial), how to process large call records from cellphones (mobile data), and how to create ML models to impute sensor data missing (sensor data).
Programming: possibility to write a simple program in Python (basic Python level)
Maths and Statistics: Training in statistics, data science of quantitative sciences.