Blog: Data Science What?
What is it after all? What is it for?
A lot has been written about Data Science, but more on the technical side, the “geeky stuff” (which I personally love) but not much has been produced about the advantages and real impact, using a more business-driven perspective, of this perfect blend of expertise fields called Data Science.
I want to present some arguments that support my opinion that it is one of the crucial sector, terminology, areas of expertise (you name it) of our time. It is so broad that almost everyone can be a data scientist (purists — “pythonists” and “dataists” — don’t be mad at me, yet).
More than going into details, being too scientific or dogmatic, I would instead focus on the practical application of data science. How Data Science can fit in our product development process and how can we extract the value from activities carried out by (the two main types of) data scientists (and related). Moreover, I would like to write about the concept of data-driven and data-informed, because it is still not possible to translate the World, as we know it, into a database (yet) — for the disappointment of data-driven approach evangelists.
How did it start?
In 2008, Dr DJ Patil and Jeff Hammerbacher, heads of analytics and data at LinkedIn and Facebook respectively, coined the term ‘data science’ to describe the emerging field of study that focused on teasing out the hidden value in the data that was being collected from touchpoints all over the retail and business sectors. Data Science is now the umbrella term used for a discipline that spans Programming, Statistics, Data Mining, Machine Learning, Analytics, Business Intelligence, Data Visualisation and a host of other subject areas. The science is constantly changing and evolving, as it moves to keep abreast of technology and business practices alike.
As with everything humans get to know and touch, there is the temptation to fit that thing in a square box, tidy it up, narrow it down to the smaller part possible. We have a tendency, and need, to accurately define and understand everything so that we can explain it beyond any reasonable doubt (besides the goal of preaching what we know to the World so that we can be seen as an expert in the age of virtual self-exposure). However, in the case of Data Science, that is still not possible. That has been a rather hard topic to discuss, to find consensus, as the answer to the question of “What is Data Science after all?” is not that straightforward. However, I would say… let it grow! Let it expand and prove that it is essential for our future in the information society. There is no problem whatsoever to have it as it is, a broad, impactful, inclusive, holistic and beautiful field embracing computer science, business and math.
Most of the knowledgeable people out there in this area would agree that Data science is a scientific and truth-seeking discipline that uses data to extract knowledge and insights. Data science is one of the fastest growing functions and is already providing tremendous value across every industry and area of study. Nevertheless, data science is still in its infancy, and like any developing field, it is often tempting to put boundaries around its definition. Rather than categorising what does or does not count as data science or arguing about why we should be data-informed but not data-driven, we believe it is most important to leave room for the discipline to evolve organically.
(usually, people would place here a Venn Diagram but I won’t do it. You can choose one from hundreds already produced, for instance here)
WHY DOES DATA SCIENCE MATTER?
Because of the increasing complexity of our digital World.
Digital. Website. Internet. Platforms. IoT. 5G. Ecosystems.
More and more, these worlds fly around us in our daily lives, as if we are talking about water, bread, or any other basic need. I bet you have heard today one of these words, at least once. Adding to this, the explosive mix of more products built, more internet-connected devices, and our appetite for being “always connected”, has caused an unbelievable increase of user-generated data, many of which are related to our interaction with the digital systems. For instance, reading this article, I would see that you (someone) have spent x minutes reading this article today. As a consequence, and because we are curious creatures, there has been a tremendous interest in mining this data to extract critical insights with the hope of building better innovative products. Smartwatch is a clear example of this, which leads me to my other section.
Why are Data Scientists in such a demand?
Simply because of the expected value they (I mean, data, mostly) add to product innovation, customer understanding (including the purchase process) and the consequent revenue growth through product/sales increase. Is that always achieved? Well, it depends on what you are looking for, in the first place.
To gain, regain or to sustain the competitive advantage in a tight accelerate market, companies’ strategy are more and more based on how well they gather, analyse and draw insights from tons of structured and unstructured data (Big Data Analytics plays an important role here) — company’s ability on this matter could be a tremendous source driving product innovation.
That is why data scientists, data engineers, data analysts, and so forth are in high demand, and for the good of our economy, I would say. A team of professional data scientists can make a massive difference in the company’ future across almost every industry.
Beyond buzz words, here are some examples of what Data Science brings to the table of every business, as far as insights from data and product development are concerned:
Prediction and Description Data Models
The most common, perhaps, role of data scientists is to build models (i.e. prototypes) using Machine Learning algorithms with the data available, which was previously pre-processed (activity that occupies, by far, the majority of data scientist’s working hours), or to implement data mining models that better categorise data in order to identify pattern in it. That data is used to train machine learning models (early stage robots) of a particular phenomenon to forecast future trends. That is not voodoo but rather based on past data and well-known statistical models. It is not sky-rocket science. Going deeper into the descriptive analytics models, a data scientist can perform, for instance, a more in-depth exploration and analysis of the user journey to generate actionable insights that ultimately result in setting roadmap and strategy for the product (for example, implement a brand new financial product targeting a specific new customer segment). As you can see, companies have a lot to gain with top-notch product analytics professionals.
Build the Right Product Version
Many companies run experiments and deliver products after evaluating the results of all the possible options and trials. Typically data scientists help to design the proper analysis for the right feature to be tested, identifying data-informed hypotheses on phenomena, and guiding product team through constant feedback using the data insights gathered. Therefore, this data scientist activity is an essential analytical role to be taken, ensuring that the right products go to market with the right features.
How is Your Business’s Health?
Data Scientists have the ability to access businesses health. This activity is quite straightforward, as business outcomes are data. So no need to find ways to get data from complicated sources. Usually, data is already in companies Information Systems, perhaps in data warehouses or in Data Lakes. Evaluate the health of a product, or a business is quite “simple” when compared with the other two activities above, I would argue. That is typically done by defining the product success firstly utilising measurable metrics. Those metrics are monitored continuously to ensure that the company is on track to meet the objective. However, as the World is not perfect (luckily), there will always be outliers (“strange” data points) where experienced analysts will focus on to understand the drivers, causes and possible business consequences behind those outliers, usually employing data visualisation dashboards and reports.
NOT ONE BUT TWO TYPES OF DATA SCIENTISTS
The selling point of Data Science is that it is helping to build the next-generation algorithms to improve decision making (mostly in business kind of context) and, most of all, to improve the way we understand the World around us (if we talk about society and all the data we can gather about ourselves). Thus, Data Science professionals fall into two main broad categories, generally speaking: product analysts and algorithm developers.
Some would argue that we can be better of with the distintions between specialists and generalists, but those two differentiation types are not mutually exclusive. However, we should keep in mind that data scientist role (and related roles) can have multiple shapes throughout their careers, and could entitle various functions that can vary significantly, or slightly, across companies and industries. After all, data is data, but a real data scientist is expected to be able to quickly capture the true essence of the industry and business she/he is working in — in this way the chances to excel and shine in her/his role are far greater.
On the one hand, product analysts are those whose role is to deliver the most data-informed analysis possible. Let us imagine that a data science team unveils the reasons behind the sudden growth of a particular financial sub-product purchase. A data-informed decision would be to act according to that analysis and to that specific reason, and lead the change in the portfolio strategy, rather than adjusting the whole bank financial services portfolio just because that trend was identified. Data is valuable input, but not the only one that is taken into consideration to make a decision. Data Scientist duty is to go deeper in the analysis of the problem/objective, with the tools at hand and often beyond that, to be able to give the best information possible to the decision makers. Having this said, product analysts have the responsibility to set the goals and help to define the product roadways and strategies. The deliverable from data scientist focused on product analysis is a report explaining the quantifiable issues and identify opportunities, as well as the data-based recommendations and solutions.
On the other hand, algorithm developers have the responsibility of incorporating data-driven features into products. Two easy and understandable examples are the introduction of face detection algorithms in Google Photos for instance, which detects who is with you in the picture, and the great Netflix recommending system, which shows to the user the movies/series he/she will like to watch with a high degree of certainty. These data scientists focused on developing and upgrading an algorithm is to leverage data to improve product performance in pursuit of a specific end goal, typically forecasting outcomes. Algorithm developers generally use machine learning and other sophisticated algorithmic techniques to make predictions based on inputs from vast quantities of data. In general, algorithm developers prototype proposed solutions and work closely with engineering teams to implement them in production. The deliverable from algorithm developers is prototype code and documentation that get provided to the engineering team.
Both types of data scientists require an analytical outlook, quantitative skills, and the ability to prioritise. While algorithm developers need more sophisticated technical knowledge and a level of software engineering skills closer to those of engineers, product analysts are primarily problem solvers who are differentiated based on their business, product, and ability to communicate effectively to a wide variety of stakeholders. Algorithm developers might be not that needed in an organisation, due the specificity of their skills. However, all companies, especially those with a significant user base, can benefit a lot from having product analysts in their teams, helping across the organisation with the product strategy and other data-related business challenges.
WHAT LIES AHEAD ABOUT DATA-DRIVEN AND DATA-INFORMED
Imagine a World in which a machine knows everything about you, more than you know about yourself. It can even know what you need to have for dinner, according to your current vitamin and nutrients deficits. Sums this up this with the knowledge it has about your food preferences and it can actually cook for you. It can also shop for you in advance and organise your meal schedule for the week. It can even know what you need to learn to achieve your career goals and advise you the best courses in the world. Ultimately, it knows your choices and can make decisions for you, knows what is right for you and plans your life. Scary? Perfection? Perhaps. It will be possible if we become purely data-driven and if we allow Artificial Intelligence to take over much of our life decisions.
In a perfect World, with perfect information, and with a complete understanding of all the drivers of your systems and how they interact with each other, the two approaches will converge. To build a perfect model, a data scientist needs to understand the phenomenon rather well; the relationship between the data and the event can be described by an ideal model (and an associated broad range of features). To evolve the models to this level of perfection, we need to continue augmenting our decision making by other subjective measures that cannot easily be fully quantified yet — therefore, the need for data-informed decisions. Humans still have a word to say.
Wisdom is hard even to translate into words. The World cannot be turned into complete data yet. As humans start to have a more in-depth perception of our relationships with objects, more and more processes will be automated, and the future will be more purely data-driven. Nevertheless, data-informed decision making will continue to be extremely important for the next few decades, and data-driven decision making will only improve with advancements from people who are data-informed.
So… are you data-driven? Stop that. For now.
Here is where the human + machine comes along.
One day, a study revealed that there was a strange coincidence between the increasing number of people drowning, in a specific country and in a particular month, and the rising consumption of ice cream in that same country and month. The researchers tried to identify the reasons for such a strong correlation. Then, several hypotheses were considered, ranging from the “it is a mere coincidence” to the potentially reasonable in a first glance “People should not eat ice cream when they go to the beach”. It turns out that the following hypotheses were quickly discarded. Researchers concluded that there was another event that was actually causing the increase on those two events: it was the regular hot weather in the summer that were leading people to the beach and to eat more ice creams. These two events were not dependent on each other. They were, indeed, a result of the former.
Causation and correlation are not the same thing and can cause problems if we are totally data-driven. Correlation is when two variables (events, in this case, the number of drownings and the ice-cream consumption) show the same statistical trends over the same period, for instance, whereas causation is when indicates that one event is the result of the occurrence of the other event.
This makes sense for us all, doesn’t it? Of course, it does, and that is because we use our wisdom — our knowledge about the World acquired through a lifetime, that plays an important role here. Otherwise, if were had to be purely data-driven decision-making in this case (and considering that weather data was undervalued in the analysis), we would see governmental institutions wrongly setting legislations to avoid more death from drowning.
Our human wisdom is not yet easily concerted into complete data.
This is a rather simple example which is commonly used to teach the difference between causation and correlation, and it illustrates well the need for humans to rely on data for decision making but to not be too relying on them, to not be pure data-driven, or else we could make lots of mistakes.
An excellent example on how important it is to distinguish a situation where a more data-informed versus a simple data-driven decision-making process is required is in the production systems, in the context of enhancing productivity and operational efficacy. For companies like James Finance or Feedzai, calculate credit risk or identify the fraudulent activity of any transaction, it is prohibitively expensive to do this manually for all transactions.
Another clear example of the application of this problem is in medicine, where artificial intelligent powered diagnosis and prognosis methods are being used.
As a result, they mostly rely on data science and machine learning models to boost their operational systems and automate all the calculations necessary to reach their primary goals, either to calculate the most accurate possible credit risk and to detect a fraudulent bank transaction.
Although much of these tasks and decision-making can be, most likely, automated, in situations where there are lower levels of confidence in the outcome statistical relevance, the decision process is not purely data-drive but often data-informed.
The World will continue to become increasingly data-driven; that is why data-informed decision making will even be more relevant. So, embrace data and data science.
It does not hurt, it is fun, and you will help society to achieve its goals. I foresee that the more you are data-informed and well-organised around data processes in the digital information age, you will gain, keep or leverage your competitive advantage. In the long run, all your stakeholders will be pleased, starting with your biggest asset: your employees.
Hope you are more informed about the vital role of data science in business contexts and in society in general. Actually, Data Science has applications not only in business decisions but also across a wide range of verticals including biostatistics, astronomy, molecular biology (not only on the usual suspects like banking, insurances and retail). Wherever you find large amounts of information, you’ll find an application for data science. In my next post, I will be focusing on the tremendous potential impact data science can have in social impact organisations and in public institutions. That will lead me to discuss the critical role some Data Science-driven initiatives, such as Data Science for Social Good and Datakind, are having in non-profit organisations.