Tag Archives: big data

The term “Data Scientist” is unstable

I recently attended a National Consortium for Data Science event at a local university (University of North Carolina at Chapel Hill).  As you can imagine, many people were talking about Big Data and Data Scientists.  But my opinion about the term “Data Scientist” seems to differ from everyone else’s.

I first heard the term “Data Scientist” about 5 years ago when I joined the Watson group at IBM. No one seemed to know what the term really meant then, but the idea was the world of Big Data was going to require a lot of new skills and that very few people had the necessary skills to successfully compete. So we had to get started building those skills now.

At that time, the mental image that formed in my head was a “cloud of skills”:  that is, a cloud of points where each point represented one skill.   The cloud contained a rather large number (20-30) of points/skills.  We could certainly identify some of the points/skills (data collection, data cataloging, model building, machine learning, etc), but it was assumed that some of these necessary, new skills were currently unknown or at best ill-defined.  I imagined the cloud was currently diffuse, but over time, as everyone began to better understand just what was required, the cloud of skills would contract, becoming denser and better focused.

Now, some five years later, the situation seems to have only gotten worse.  The term “Data Scientist” has become an all-inclusive, catch-all and kitchen sink.  Whenever some sees something that seems to be required, they toss it into the cloud of skills every Data Scientist “must have”.  The cloud is getting bigger.  It is getting broader. It is getting more diffuse.

I agree that many skills are needed to adequately work with big data.  Some of the skills in this cloud are

  • business analyst who identifies what a business needs or would find valuable along with plausible ideas of what might be technically possible
  • architect capable of turning the business analyst’s vision (which is likely partially right and partially wrong) and converting it into something that can actually be built
  • data gatherer (both raw data and ground truth)
  • digital rights manager
  • data manipulator
  • data organizer and cataloger
  • model builder
  • machine learning expert
  • visualization builder
  • security architect to ensure data is protected
  • DevOps person to continuously fuse all these parts together over the course of many experiments
  • statistician
  • lawyer to oversee the contracts across the many different parties involved
  • dynamic presenter who can persuasively demonstrate the solution

Sure, there may be a few brilliant, lone wolf geniuses out there who possess all these skills.  But, realistically, it is inconceivable to me that there will ever be a large number of people who have all of these skills, and thus become true examples of the all-inclusive “Data Scientist”.

Instead of trying to jam all of these skills into a single person, what we really need are “Data Teams”.  Every other kind of engineering uses teams, so why should the world of Big Data be different?  I predict two possible evolutions for the term “Data Scientist”:

  1. [all-inclusive]  The term Data Scientist continues to be all-inclusive, becoming so broad and so ill-defined that it becomes unstable and completely collapses, signifying nothing useful.  It will be  added to the trash heap of unused terms.
  2. [data-focused]  The term Data Scientist will come to mean a person responsible for the large volumes of data.  In the list above, this includes data gatherer, digital rights manager, data manipulator, and data organizer.  Of course people who play this role can have other, non-Data Scientist skills, too.

Let’s stop pretending we are searching for, or attempting to train, individuals who possess all of these skills.  The all-inclusive version of the term “Data Scientist” is not stable.

I strongly encourage we adopt the “data-focused” version, because I believe we absolutely need people to perform this new and critically important role. Data is a new element to the puzzle and it will require special tooling and expertise.  And Data Scientist seems to be the perfect term for someone concerned with the technical, organizational and legal aspects of Big Data. But we need to see the Data Scientist as just one member of a larger Data Team.

What do you think?  Is there a better term for a data-focused member of the team?  What skills can we realistically expect from a “data-focused” person?

Local Clouds

LocalCloud“Cloud” is one of the industry’s current buzz words.  And while there are many good reasons to implement and use global clouds,  we shouldn’t thoughtlessly push everything to such clouds. Let me explain why I say this, and some of its implications.  In particular, I am going to argue that local clouds are a better solution for some problems.

A big reason for global clouds is the desire to aggregate all the data into a single cloud (OK, it may just be virtually aggregated into a single, distributed cloud infrastructure—but all the points below still apply).  This use case is the poster child for “big data”.  All the aggregated data can be analyzed by the global cloud, this way and that way,  correlations and patterns can be found, predictive models can be constructed and validated,  etc.

But a big reason against global clouds—the one that pushed me down this line of thinking—is privacy.  There is a saying: “If you don’t pay for a product, you ARE the product”.  How can so many large cloud infrastructures be free to users?  Because cloud providers harvest and sell their users and users’ data.  I don’t like paying for services anymore than the next guy, but I really don’t like be sold.  Me any my data are MINE !

Another huge, industry buzz word is IoT (Internet of Things).  The IoT movement starts at the opposite end of the aggregation spectrum with atomic nodes on each individual device.  Each atomic node has to support a minimal set of sensors and actuators plus enough networking to enable other, more capable, nodes to read and manipulate the atomic node.  Most atomic nodes do not need to also be a “compute node”, which contain a general purpose processor and (relatively) large amounts of storage.  Most atomic nodes only need to communicate (directly or indirectly) with a compute node.

A common view of IoT is that the compute node should be in the global cloud.  This makes some sense, because ALL of the data from ALL of the atomic nodes are available for global analysis in the global cloud.  But it also gives tremendous (unfair?) advantage to the cloud owner.  Typically the cloud owner claims ownership of all this data—and they certainly have legal claim to all derived data they construct from the individual data.  Plus, they can reliably infer a lot of personal information from it:  personal habits, when are you active, what do you like, which topics interest you, etc, etc, etc.  I see this as a huge loss of personal privacy.

But do we really need to push massive oceans of data from all the atomic IoT nodes up to global clouds?  Are global clouds the only viable location for compute nodes?  I say, “No”.

I believe a better way—at least for some applications—is to build local compute nodes, near the atomic nodes.  The local compute node(s) acts like a local cloud.  But none of this data is ever pushed up to a global cloud.  This achieves many of our objectives (atomic nodes remain simple, a single compute node can service multiple atomic nodes) but does not violate privacy.

I’ll have more to say about “local clouds” in upcoming posts.

2015 is the Mole-of-Bits Year

MoleOfBitsTake a look at this graphic from IDC.  It estimates the totals number of bits in the world over time.  There are many things going on in this graphic.  It shows that enterprise data is certainly growing, but does not comprise the majority of data; sensor data and social media data far outstrip it.

It also shows that a huge fraction of all data contains uncertainty. This has dramatic implications for old school programmers.  Programming absolutely must continue to adopt new approaches to handle uncertain input data.  Particularly for emerging cognitive applications. The traditional excuse of classifying ANY input errors as “garbage in” just won’t not cut it any more.

But my favorite part of this graphic are the axes; forget the graphic curves (how often does that happen). The x-axis shows time with 2015 on the far right.  The y-axis shows the number of bits in the world. For the chemists among you, 10 to the 23rd is an essentially Avagadro’s number (6.02 E23), which is the number molecules in a “mole”.  What does this data mean?  Imagine you’re holding a tablespoon filled with water.   You’re holding a mole of water molecules.  The chart above implies that this year, 2015, there will be one bit of data for EVERY molecule of atomic H2O in that tablespoon.  To me, that is nothing short of INCREDIBLE and AWESOME.  When I was growing up, I remember trying to imagine how we would ever have such a gi-nor-mous number of macroscopic things.  Well here we are, and in my lifetime.  I’m moved.

So I hearby officially declare

2015 as the “mole of bits” year