I recently attended a National Consortium for Data Science event at a local university (University of North Carolina at Chapel Hill). As you can imagine, many people were talking about Big Data and Data Scientists. But my opinion about the term “Data Scientist” seems to differ from everyone else’s.
I first heard the term “Data Scientist” about 5 years ago when I joined the Watson group at IBM. No one seemed to know what the term really meant then, but the idea was the world of Big Data was going to require a lot of new skills and that very few people had the necessary skills to successfully compete. So we had to get started building those skills now.
At that time, the mental image that formed in my head was a “cloud of skills”: that is, a cloud of points where each point represented one skill. The cloud contained a rather large number (20-30) of points/skills. We could certainly identify some of the points/skills (data collection, data cataloging, model building, machine learning, etc), but it was assumed that some of these necessary, new skills were currently unknown or at best ill-defined. I imagined the cloud was currently diffuse, but over time, as everyone began to better understand just what was required, the cloud of skills would contract, becoming denser and better focused.
Now, some five years later, the situation seems to have only gotten worse. The term “Data Scientist” has become an all-inclusive, catch-all and kitchen sink. Whenever some sees something that seems to be required, they toss it into the cloud of skills every Data Scientist “must have”. The cloud is getting bigger. It is getting broader. It is getting more diffuse.
I agree that many skills are needed to adequately work with big data. Some of the skills in this cloud are
- business analyst who identifies what a business needs or would find valuable along with plausible ideas of what might be technically possible
- architect capable of turning the business analyst’s vision (which is likely partially right and partially wrong) and converting it into something that can actually be built
- data gatherer (both raw data and ground truth)
- digital rights manager
- data manipulator
- data organizer and cataloger
- model builder
- machine learning expert
- visualization builder
- security architect to ensure data is protected
- DevOps person to continuously fuse all these parts together over the course of many experiments
- statistician
- lawyer to oversee the contracts across the many different parties involved
- dynamic presenter who can persuasively demonstrate the solution
Sure, there may be a few brilliant, lone wolf geniuses out there who possess all these skills. But, realistically, it is inconceivable to me that there will ever be a large number of people who have all of these skills, and thus become true examples of the all-inclusive “Data Scientist”.
Instead of trying to jam all of these skills into a single person, what we really need are “Data Teams”. Every other kind of engineering uses teams, so why should the world of Big Data be different? I predict two possible evolutions for the term “Data Scientist”:
- [all-inclusive] The term Data Scientist continues to be all-inclusive, becoming so broad and so ill-defined that it becomes unstable and completely collapses, signifying nothing useful. It will be added to the trash heap of unused terms.
- [data-focused] The term Data Scientist will come to mean a person responsible for the large volumes of data. In the list above, this includes data gatherer, digital rights manager, data manipulator, and data organizer. Of course people who play this role can have other, non-Data Scientist skills, too.
Let’s stop pretending we are searching for, or attempting to train, individuals who possess all of these skills. The all-inclusive version of the term “Data Scientist” is not stable.
I strongly encourage we adopt the “data-focused” version, because I believe we absolutely need people to perform this new and critically important role. Data is a new element to the puzzle and it will require special tooling and expertise. And Data Scientist seems to be the perfect term for someone concerned with the technical, organizational and legal aspects of Big Data. But we need to see the Data Scientist as just one member of a larger Data Team.
What do you think? Is there a better term for a data-focused member of the team? What skills can we realistically expect from a “data-focused” person?