Note from the Editor:
In the old days of 2013, the OSDSM was born. Then, there were “little to no Data Scientists with 5 years experience, because the job simply did not exist.” (David Hardtke, Nov 2012) Since then, history has witnessed many things, including:
• Data Scientists working across industries and the world
• social media manipulation disrupts many elections
• BLM and #metoo and Extinction Rebellion and many other social movements
• machine learning begins falling under engineering domain
• a pandemic
• climate change disasters becoming very frequent while climate warms faster than predicted
• remote work becoming common • multiple global recession shocks
In that decade, Data Science has seen growth of jobs, shortfall of goals, success in many industries, abject failure in others, and nefarious use cases. In particular, adverse consequences and complications of learning from data appear in too many examples: elections undermined by psychographics, dismal gender (Men=74%) and BIPOC diversity in the AI field, a revived eugenics, an explainability crisis, facial recognition used to identify people and systematically detain them, “aggression” detection microphones in schools, and many others. It has never been more clear that we need to talk about the real world impacts of our work, and consider how our creations are used. As you consider this, read a prescient novel that grapples with the consequences of birthing, of creation, of technology.
Like any tool, data-driven technologies are indifferent to the morality of their ends. Perhaps the greatest risk of all is leaving this tool in the hands of the few expensively-educated people who cannot possibly represent all of us. To balance this, open source movements seek to lower the barriers to education for everyone. Data science and data literacy must be widespread, accessible, and leveraged for building our collective future. More than ever, we need that future to be built by members of society who are diverse and focused on generative, sustainable, resilient, emergent solutions. After all, the things we build are mirrors of ourselves (seriously, read Shelley’s Frankenstein).
Computers reflect the biases and belief systems of the people programming them -@alicegoldfuss
The OSDSM is built with the belief that open source education makes a diverse, collective, generative future-building possible. I hope that you are one of the next people – whether you call yourself a Data Scientist or not – to help make better decisions with the scientific process, critical thinking, and everything else your unique perspective brings to the table. This rewritten curriculum focuses on what is needed to be successful in the entry-level role, but that is just a generic outline; truly, I hope where you take it extends far beyond that.
Start here 👇
The open-source curriculum for learning to be a Data Scientist. Curriculum resources from both universities and working Data Scientists focuses on foundational theory and applied skills. The OSDSM is collectively-maintained and open to PRs.
The goal of this curriculum is to prepare the student for an entry level Data Scientist role, using open source materials, at no cost but with the same calibur of materials found in the most reputable paid programs. Books not offered for free are often available through a public library, also indicated here with current list price. The Masters is self-guided and self-accredited. To better support credibility, the structure now includes a Capstone project intended to demonstrate the student’s problem solving approach, skills in execution, and communication. Upon completion, the student can award oneself a Credential on LinkedIn from the Open Source Data Science Masters. As with all things, the OSDSM is best played as a team sport (try finding people on r/learndatascience).
This is called a “Masters” because it is primarily concerned with “upper-level” college course material in mathematics, programming, economics, or related disciplines. Come as you are!
This is a critical foundation for what is to come; don’t skip!
One could argue that “Data Science” is a recent term for an already existing information analysis discipline. Humans instinctually search for patterns, a purpose we also see in this more digitized discipline. Read different sources (and search beyond this list) about the uses of data science.
$18– Narrated cases of Data Science at play in the real world.
$17– From the inside of OKCupid, real examples of how data science can illustrate human behavior.
When there are no answers in the back of the book, how do you proceed? Breaking down problems is a skill, one that can and should be learned. Follow Pólya’s process, and for extra credit, seek out resources on computer science decomposition.
It is crucial as a Data Scientist that you show integrity in and transparency of scientific process. Even if you’ve been here before, review and draw out the process diagram for the scientific method.
Get familiar and comfortable with manipulating data in a database with a common relational querying language. There are diverse query languages, but SQL is a widely used foundation.
The foundational mathematics for working with large samples of data. Spend time in exercises until you feel highly confident in the key topics of Linear Algebra. It will serve you well.
How can we answer questions with data? Everywhere you look, you’ll see methods from statistics. Spend a lot of time here!
If you’re starting from scratch with Python, start with this series.
Set up your computer to use tools locally.
Get familiar with using tools to do data analysis. Pro tip: Write out what you’re going to do before you do it! When you hit a snag, return to your plan and rechart as necessary.
How does a computer know what to do? Algorithms are instructions with a fancy name. Learn how instructions are encoded, how to think about structuring those instructions, and patterns for making it work in code.
Courses with many of the topics above included. Be sure you fill in any gaps!
Choose what is most interesting to you, or most relevant to the work you plan to do.
A branch of statistics that uses graphical models and specialized statistics to describe and model cause and effect.
The imperfect and immensely useful art (science?) of transforming human language into data.
Human relationships can be modeled as a network or graph. Many other things suit this model, too. Working with graphs
This is a huge space with infinite things to learn. For advanced statistical foundation, see The Elements of Statistical Learning.
The most persuasive data stories are ones you can see with your own eyes. Make it visual!
If you have interest in operations management, manufacturing, supply chains, or other real world queuing problems, dig in here.
Learn about how doing science with others and for businesses can work.
In ideal terms, a Data Scientist advises strategic decision-making using data-backed analysis and tested hypotheses. YMMV as this depends on the company needs and the team being supported.
For a Data Scientist’s work to be impactful, they must be effective at communicating their work and findings. In any setting, clear logic and effective business writing are crucial to reaching your audience. And of course, doing Data Science with a team over zoom is different from being in person in an office. There is much more written communication and asynchronous consumption of content in the remote office environment. More than ever, writing and communication skills are crucial to being an effective Data Scientist for yourself and your team.
In the modern organization, it is very rare that a Data Scientist works in isolation. Communicating the value of the work being done is crucial to getting buy-in from partners whose decisions and operations depend on your work. Those partners might be:
Typically, the more clearly you are able to communicate the “why”, the value of what you are doing, the more these teams will be able to support you and your work in conversations you may not be a part of. Even if others don’t understand “how” you do your work (which is very important to you and your manager!), they will be able to understand and repeat a well-communicated “why”. This is why we write Specs, to get buy-in and allow for questions or input, before the work starts.
A document conveying the motives, direction, investment, and expected value of the work.
A slide deck or document with the goal of conveying the results of the work and how the findings support an important decision(s).
Best appended to the Spec, and summarized in a slide deck for easy consumption. Depending on the culture of the group, slides or a short docuemnt may be easier to look through to understand the results of the work. In the remote work era, think about how your work will be passed around and make sure your “above the fold” is easy to understand and clearly conveys the “why” and results in particular.
Example: A particularly polished presentation of map quality study results showing higher data quality in US maps on OSM than commercially available alternatives. The impact of this work was a) increased confidence in service reliability and b) enabled the company to decide against buying a commercially available annual license costing ~$10mi/yr.
Choose a meaningful project or dataset to demonstrate what you’ve learned.
Show the process you used to disprove your hypothesis, preferably in a jupyter notebook. See examples to get a taste of how you can showcase your work.
$90& Study Group
Take TwoChange Log
Please Contribute; this is Open Source!
Fearless Maintainer: @clarecorthell
RIP v1.0 commit