Monthly Archives: March 2016

New Job

After  spending almost 5 years working in IBM’s Watson and Watson Health groups, I am moving on.

I am extremely pumped to join IBM Research.  This is a dream come true !

We’re beginning work on a project called Cognitive Eldercare.  Our goal is to keep elders in their home (or assisted living facility) as long as possible and prudent.  With the planet’s aging population, there is an absolutely gi-nor-mous market for eldercare solutions, especially those that cognitively integrate many disparate aspects.

My task is to build the key foundational layer called Knowledge Reactor which is a large, Titan, graph database, extended to emit (react to) graph changes by publishing those changes to a Kafka messaging infrastructure.  We’ll also work (play is more accurate) with lots of IoT devices:  sensors, effectors, robots, drones, and who knows what else.

All that would be cool enough.  But the best part, is we’re going to be writing a lot of agents that react to incoming events and graph change events.  So, I’ll get to build more multi agent systems, and extend my PhD research.

Can life get any better?  God is good !

The term “Data Scientist” is unstable

I recently attended a National Consortium for Data Science event at a local university (University of North Carolina at Chapel Hill).  As you can imagine, many people were talking about Big Data and Data Scientists.  But my opinion about the term “Data Scientist” seems to differ from everyone else’s.

I first heard the term “Data Scientist” about 5 years ago when I joined the Watson group at IBM. No one seemed to know what the term really meant then, but the idea was the world of Big Data was going to require a lot of new skills and that very few people had the necessary skills to successfully compete. So we had to get started building those skills now.

At that time, the mental image that formed in my head was a “cloud of skills”:  that is, a cloud of points where each point represented one skill.   The cloud contained a rather large number (20-30) of points/skills.  We could certainly identify some of the points/skills (data collection, data cataloging, model building, machine learning, etc), but it was assumed that some of these necessary, new skills were currently unknown or at best ill-defined.  I imagined the cloud was currently diffuse, but over time, as everyone began to better understand just what was required, the cloud of skills would contract, becoming denser and better focused.

Now, some five years later, the situation seems to have only gotten worse.  The term “Data Scientist” has become an all-inclusive, catch-all and kitchen sink.  Whenever some sees something that seems to be required, they toss it into the cloud of skills every Data Scientist “must have”.  The cloud is getting bigger.  It is getting broader. It is getting more diffuse.

I agree that many skills are needed to adequately work with big data.  Some of the skills in this cloud are

  • business analyst who identifies what a business needs or would find valuable along with plausible ideas of what might be technically possible
  • architect capable of turning the business analyst’s vision (which is likely partially right and partially wrong) and converting it into something that can actually be built
  • data gatherer (both raw data and ground truth)
  • digital rights manager
  • data manipulator
  • data organizer and cataloger
  • model builder
  • machine learning expert
  • visualization builder
  • security architect to ensure data is protected
  • DevOps person to continuously fuse all these parts together over the course of many experiments
  • statistician
  • lawyer to oversee the contracts across the many different parties involved
  • dynamic presenter who can persuasively demonstrate the solution

Sure, there may be a few brilliant, lone wolf geniuses out there who possess all these skills.  But, realistically, it is inconceivable to me that there will ever be a large number of people who have all of these skills, and thus become true examples of the all-inclusive “Data Scientist”.

Instead of trying to jam all of these skills into a single person, what we really need are “Data Teams”.  Every other kind of engineering uses teams, so why should the world of Big Data be different?  I predict two possible evolutions for the term “Data Scientist”:

  1. [all-inclusive]  The term Data Scientist continues to be all-inclusive, becoming so broad and so ill-defined that it becomes unstable and completely collapses, signifying nothing useful.  It will be  added to the trash heap of unused terms.
  2. [data-focused]  The term Data Scientist will come to mean a person responsible for the large volumes of data.  In the list above, this includes data gatherer, digital rights manager, data manipulator, and data organizer.  Of course people who play this role can have other, non-Data Scientist skills, too.

Let’s stop pretending we are searching for, or attempting to train, individuals who possess all of these skills.  The all-inclusive version of the term “Data Scientist” is not stable.

I strongly encourage we adopt the “data-focused” version, because I believe we absolutely need people to perform this new and critically important role. Data is a new element to the puzzle and it will require special tooling and expertise.  And Data Scientist seems to be the perfect term for someone concerned with the technical, organizational and legal aspects of Big Data. But we need to see the Data Scientist as just one member of a larger Data Team.

What do you think?  Is there a better term for a data-focused member of the team?  What skills can we realistically expect from a “data-focused” person?