What are the key stages of a data science project?

The first time you take on a data science project yourself as a freelance data scientist, it’s tempting to think of the project breakdown like this:

25% data cleaning → 50% data science → 25% deployment

Pie chart of a data science project broken down into data cleaning, deployment, and actually doing data science.

Of course, if you take on a project with this expectation, it’s likely to over-run. Deploying a data science model can take much longer than expected, and cleaning data can also be surprisingly complex.

I’ve seen a lot of tongue-in-cheek blog posts saying that the breakdown is more like this:

Pie chart of a more realistic breakdown of a data science project: 25% data cleaning, 50% data science, 25% deployment

However, I think that there’s another major stage that people are missing, which is the stage of gaining a client’s trust, getting NDAs signed, and getting access to the data. I’ve seen projects where it took 6 months to get hold of the data from the client, for 1 month of data science work. I have also seen a fair number of projects fail for the same reason.

So my take on this is that the breakdown is more like this:

Pie chart of my breakdown of a data science project: 10% requesting NDAs and data, 20% data cleaning+data science, 20% deployment, 50% waiting for data

Why is getting hold of data such a problem?

As data scientists, we encounter two kinds of data:

public data which is accessible to everyone. Examples include worldwide coronavirus statistics, weather data, the Titanic dataset.
private data (or sensitive data), which is held inside organisations.

A paid data science project will usually involve private data. For many companies, its customer dataset, or manufacturing logs, or user data, are the crown jewels. The data is fiercely guarded. You will need to sign an NDA before you’re allowed to look at it. The consequences of a data leak are severe (just ask Ashley Madison).

Companies don’t like to give data to outsiders, least of all to freelance data scientists. Even if the stakeholder in the business wants to share the data, several people would need to sign off on it, and it takes objections from just one of them to stall the project.

What can we do about it?

Data scientists have to accept that getting hold of data will always be an obstacle. The best way to mitigate the risk of data not showing up is to schedule a kick-off meeting, ideally a month before the project. Request the data from the client at this meeting, and follow up regularly. Make sure that the data is available before the first billable day of the project. This means that any blockers can be dealt with in a timely fashion.

Resources for planning a data science project

I have prepared a number of open and free resources which can help us plan data science projects on my site https://fastdatascience.com/resources/.

I built these over several years, trying to make checklists every time things went wrong, rather than relying on intuition… so here they are!

Data science project kickoff checklist (PDF format)
NLP project planner (in-browser tool)
Data science roadmap planner (PDF format)
NLP project risk tool (PDF format)

Conclusion

A large and often overlooked obstacle to getting a data science project off the ground is getting data from the client. The process can involve lots of meetings, contracts and NDAs before you even begin to sign a contract for the work.

The best way to mitigate this obstacle is to anticipate it and to plan for 1 month of ‘waiting for data’. This month involves kick-off meetings with a client, NDA signing, and regular follow-ups. If this step is taken, the probability of success is much higher.

This article inspired a post on fastdatascience.com.

Thomas Wood, freelance data scientist in London, UK. Speciality area: natural language processing (NLP)

What are the key stages of a data science project?