Horizontal Innovation in Data Science

14 min readSep 3, 2023

Innovation is a key driver of progress and can be found in every field, taking on various forms. In the context of Data Science, innovation can be broadly categorized into two types: vertical and horizontal. Vertical innovation is tailored to a specific field and involves developing new data science solutions, such as machine learning models, in areas that were previously less empowered by data science, such as sales. On the other hand, horizontal innovation is characterized by the advancement of existing data analysis techniques, algorithms, and tools, such as enhancing an experimentation platform for broader adoption. The former aims to address domain-specific challenges creatively, while the latter can improve broader Data Science rigor and efficiency once adopted across the teams.

In this article, I would like to provide practical insights into how to drive horizontal innovation across the data science team. To illustrate this, I’ll draw upon my experience from a past project. I hope that my reflections can be beneficial by shedding light on potential challenges and opportunities others might face.

Environmental Factors and general steps for horizontal innovation

First, there are several environmental factors to consider in whether a horizontal innovation would likely be cultivated:

[Team size and tooling similarity] The bigger the size of a data science team is, along with more shared toolings; the more likely horizontal innovation would happen, since there would be more impact delivered through all the data scientists in the team.
[Data science maturity stage] The more mature a data science team is, the more likely horizontal innovation would be needed and happen. When a data science team is built at its early stage, it usually focuses on vertical innovation to prove its value, rather than seek for higher rigor and efficiency.
[Executive support] Having executive support for broad team adoption is one key incubating environment, and this is usually easier under a centralized data science team structure (e.g. all DS team reports to a chief data officer) than a decentralized structure (e.g. each DS team reports to respected business group leads)

Once the above environments are understood, to make horizontal innovation a reality in your data science team, here are three general steps:

[Problem identification] It’s essential that the problem is recognized as a common challenge among your colleagues, rather than something only you are experiencing.
[Designing the solution] Typically, there are multiple approaches to resolving a problem, and it’s important to ensure that your solution clearly outperforms other alternatives.
[Driving adoption] The value of horizontal innovation is realized when embraced by the team. Therefore, ensuring widespread adoption is critical to maximize its impact.

Now, that we are familiar with the environmental factors and general steps, let’s move into one case study where I led the establishment of a new metrics foundation framework for LinkedIn’s Data Science team in 2018

Case study: on developing a new metrics foundation framework

My pain point: “I don’t like Pig”

In early 2018, Data Scientists at LinkedIn wore multiple hats to create business value: Besides leveraging data analysis to drive business decision-making, many other works, such as building up data foundations powering production metrics and dashboards (more like Data Engineering), also fall into the Data Scientist team’s roadmap.

During the quarterly planning time, I found a need to surface some extra information in a dashboard for better business insights, but that information was not available in existing datasets, so I included the item of “adding new columns in a dataset” in my quarterly plan.

This item was considered a bit of a mundane task at the Data Scientist standards: it is necessary to do for the business need, but not too exciting. One needs to read through legacy code and modify the component to refresh the updated logic. For me, this is OK except the most unacceptable part is that the legacy code was written in Pig.

Speaking of Pig, it may not be a commonly known language at the moment of writing (Aug 2023), but it was the first Hadoop language (hopefully people still know what Hadoop is) to enable the semi-query process. It gained some popularity in early 2012–2014 and then lost ground to more SQL-like languages (Hive, Presto), hence many fewer companies were using it for building data pipelines. Since LinkedIn was one of the few earlier adopters (link1, link2), many data foundation pipelines were built in Pig, and people who develop new pipelines found copying existing code logic easier than rewriting them in alternative solutions (e.g. Hive), so the Pig code base continued growing and it was still a “must-know” language for new members joining the Data Science team in 2018.

While I have been largely able to shield myself away from writing code in Pig since joining the company in mid-2017, by creating new datasets pipelines (in Hive) to replace old ones. This time, unfortunately, the logic is simply too complicated to overhaul in a short time and it was infeasible to avoid Pig.

Apache Pig Logo. Source: https://en.wikipedia.org/wiki/Apache_Pig

After a few days of diving into the code, I started to find it unbearable, and seriously think: could there be an alternative solution, that I can push the organization to adopt and totally ditch Pig? So I and others do not need to write in Pig anymore?

Finding organization’s pain point: Logic inconsistency in metrics

It was not easy to convince everyone to ditch Pig, why? Most people already learned Pig during their stay at the company; some liked the syntax and considered it more powerful than SQL-like language (e.g. Hive), and some commented that it may not be perfect but isn’t worth the effort to totally change the code base. Even after I demonstrated a better solution (i.e. Spark SQL) with much superior execution speed (2–3x faster), people still argued it might be an OK tradeoff to continue using Pig. To my bigger surprise, later I learned there was even a proposal about “Pig on Spark” underway, to speed up execution while maintaining using the Pig syntax. It was clear that strong resistance existed.

Apache Spark Logo. Source: https://en.wikipedia.org/wiki/Apache_Spark

Why didn’t people just move on and agree on my proposal to use Spark, a clear industrial trending technology? Initially, I was confused about this strange phenomenon but later found the reason: basically, my pain point (“I don’t like Pig”) was not the organization’s pain point then. If I want to convince the organization to ditch using Pig, I need to find the organization’s pain point and develop a solution to address that.

Luckily one big pain point of the organization was emerging on the horizon: metric inconsistency. Imagine this situation: one customer may see a supposed-to-be-same number (e.g. annual sales) on one product interface and a different one from another, and this would largely impair customer trust and potentially lead to customer attrition. This problem was observed simply because the two data pipelines powering the (supposed-to-be-same) numbers on different product surfaces were using the metric calculation logic inconsistently: one was powered by the updated new logic, while the other used the old one.

This surprising observation led to the formation of a horizontal program inside the Data Science team, to audit all logic units across the production code base: It was a huge effort to identify those discrepancies and resolve them manually. The team resolved some key discrepancies but no one could promise this would never happen again. This was considered a big risk for the organization, and one great lever (for me) to align the organization’s interests with my personal ones.

Understanding the root cause: scripting language

What’s the root cause of the logic inconsistency? There are many reasons (e.g. human errors, code governance), but in my opinion, one fundamental problem lies in how data pipelines are designed.

A bit more background about the data pipeline foundation at LinkedIn in 2018 (Disclaimer. these are all publicly available information): to democratize data pipeline creation, the company developed a powerful in-house platform, Unified Metrics Platform (UMP): anyone can write a Pig/Hive/Presto code and leverage the platform to orchestrate the creation of a dataset/metric. It was a great innovation and made creating a new pipeline much easier (users don’t need to worry about the infrastructure complexity, and just need to write the scripting code), the adoption of the UMP is high across the Data Science team, and there were hundreds of metrics built up.

Since most languages used in the platform are scripting languages (e.g. Pig, Hive), if others want to re-use part of the metric calculation logic, they usually need to copy a code block from place A to place B, in order to replicate the same logic for other metric calculation in place B. However, if the code block in place A was updated by the original owner, the code in place B would not be updated accordingly (because the owner simply doesn’t know it was copied!), leading to logic inconsistencies. This is the root cause.

Visualization on “why logic inconsistency would happen”

On the opposite side, let’s take a look at standard software design: one code module (e.g. A) with a specific functionality is usually encapsulated into a single function (or class), has its unit test to validate logic correctness, and can be called by other modules (e.g. B) if they want to re-use the logic. Whenever the logic for A may be updated, it would be unit-tested and then its change would be automatically propagated through and enable other downstream modules (e.g. B) to be updated. So one doesn’t need to worry about potentially inconsistent behavior.

When I was thinking deeply in this direction, I found it fascinating how the data engineering approach is different from the software design, and the root cause looks clearer, but how to solve the problem? Ditching Pig and replacing it with Hive/Presto, or even Spark SQL, could not fully address the inconsistency problem, because they are still scripting languages that would be written out in a “procedural paradigm”. Then I started to look deeper: Scala is the native language of Spark and it can let users write code in an “object-oriented programming (OOP) paradigm”; if we could encapsulate key logic components and enable them to be referenced directly, we could transform the metric building practice from a script writing practice into an objective-oriented design paradigm. Meanwhile, this paradigm shift would naturally bring along the speed advantage of Spark execution.

Visualization on “how logic inconsistency can be resolved “

This feels exciting and much more promising: My proposed solution is to change from “writing procedural data pipeline” into “designing data foundation under OOP paradigm with logic modularization”, and by accident, Pig should no longer be needed :)

Building the Minimum Viable Product (MVP)

I quickly wrote out a proposal, boldly named it “the next generation of metrics foundation framework”, and claimed the benefits including resolving metrics inconsistency and dramatically including the execution speed. Thanks to the support from my managers, it directly went up to the data science leadership and attracted attention. I was asked to share a minimum viable product to demonstrate how it works and prove the stated benefits.

I vividly remember it was April 21, 2018, during my 12-hour flight back to China, I was coding on the airplane for straight 8 hours. Without any internet access, there was also no debugging functionality from InteliJ (the programming interface). I wrote up roughly a thousand lines of Scala/Spark code to materialize the framework that was conceptualized, regardless of whether it would execute or not, just putting the structures and logic components (e.g. unit testing) as if it would work. When the flight landed and the internet access was granted, I spent another two days debugging all the issues and test-running the workflow. It worked: all the modules were developed in Scala under the proposed design paradigm; each module encodes one key logic piece and they can be referenced easily, and the whole codebase can be compiled error-free. Meanwhile, with the Spark execution engine, the speed is 3–5 times faster than Pig: I tested this multiple times.

This MVP demonstrated that developing a data pipeline could be done in an “objective-oriented fashion”: one needs to design what’s the needed data component, check whether existing logic exists, and build up new ones if needed. This is far more different from the previous “procedure approach”, where one needs to copy / past existing logic and write up the code mechanically. The code was stored in a new multi-repo codebase (referred to as multiproduct), and it served the purpose of demonstration: the MVP went well and the leadership acknowledged the advantage of this approach. However, this would be a huge change and it was still uncertain how others perceived this approach, I was asked to assess its accessibility and adaptability for the organization.

Organizational alignment and feature enhancement

The new framework has been embraced by my immediate team (~20 members), as the combination of high-speed performance and design thinking was viewed as valuable. However, we soon encountered additional challenges that hindered the framework’s expansion. These challenges can be roughly categorized as follows:

Design Challenges: During the development of the Minimum Viable Product (MVP), there were several internal libraries developed to standardize data processing (e.g. reading data) which I wasn’t aware of.
Platform Challenges: While my proposal offered a quicker path to expand Spark adoption, our cluster stability wasn’t fully prepared to support this expansion yet.
Codebase Governance Challenges: A decision needed to be made regarding whether to establish a single repository (mono-repo) for the entire analytics organization or to permit sub-analytics teams to maintain their individual repositories for faster iteration.
Adoption Challenges: The high entry barrier for learning Scala, which was not a standard programming language for DS candidates, posed a challenge for the broader DS community adoption.

To tackle these challenges, I spent a lot of time discussing, learning, and brainstorming with many senior technical leaders across the data organization; these efforts eventually translated into the following solutions:

[Design approach] I collaborated closely with the data infrastructure team to integrate (then) state-of-the-art components, including Dali, into the framework. This enabled the new framework’s functionality to be compatible and can benefit from future infra advancement.
[Platform capability] I worked together with the Spark infrastructure team, and advocated and aligned with leads on a faster path to facilitate Spark adoption. Much effort was put into improving the cluster stability so that end users (e.g. DS) would feel less pain due to platform capability.
[Code governance solution] I sought input and feedback from technical leaders across the analytics organization and proposed a solution with the respected analytics leadership to seek alignment. Eventually, we settled into a middle ground: major analytics org will have code repos so that their code base can be maximally shared internally. This is a reasonable tradeoff between potential code duplication and development efficiency.
[Technology adoption enhancement] I refined the architecture design by incorporating more Spark SQL components. The framework now uses Scala only to establish the foundational structure, while Spark SQL is the key customizable component for analytics team members to incorporate special business logic. This adjustment aimed to reduce the entry barrier for most data scientists as SQL is a much more commonly used language in the DS community.

As these challenges were successfully addressed, along with alignment with key technical and organizational leaders from data organizations, the framework was then poised for broader expansion.

Roadshow, training, and broad adoption

While we expanded the Scala/Spark metrics framework, others also experienced Spark as a better programming language for more use cases (e.g. Spark SQL is generally faster for daily analytical purposes). So I was granted DS leads support on driving broader adoption of not only this Scala/Spark metrics building framework but also Spark language itself.

What’s coming next may be less technical, but very much fun and exciting. We formed a Spark Cross-Team Forum, which is a technical committee with five senior DS across the analytics team, to drive up Spark adoption inside the Data Science organization (~200 data scientists). To jump things started, we sought funding from DS leadership to bring external vendors and it was swiftly granted. I remember for the first Spark session we hosted, due to the overwhelming interest from the DS community, the RSVP registration was filled within 2 minutes of the invitation email being sent out (Yes, we informed everyone ahead that the RSVP would be sent out on one precise time, and I saw the responses comes right in), along with a long waiting list. Later, the committee designed more customized training materials, ran different training sessions, and hosted office hours to help address specific problems encountered. The broad spectrum of our approaches brought Spark and the new metrics framework to many in the Data Science team.

Besides the fun of chairing the committee and working with talented peers, it was very rewarding to simply watch our Spark adoption number go up every month. We have an internal dashboard to track the Spark adoption (both for metrics foundation building, as well as ad-hoc usage in daily analytics). Seeing those numbers going up makes us feel our efforts were well worth it. By the time I left the company later (around 2021), using Spark to run ad-hoc jobs and build metrics was already a common practice.

Key learnings

This encapsulates my case study journey about how to drive horizontal innovation. Upon reflection, this experience was exceptionally satisfying: it not only tested my technical design thinking but also sharpened my organizational skills by necessitating broad alignment. Ultimately, it helped to solve one of the organization’s most important problems and created value for the company.

While my specific case may not directly apply to all, several key learnings could have broader relevance:

Thinking beyond the status quo: When facing a problem, it’s usually tempting to stick with existing solutions, it’s crucial to periodically step back and question what truly needs solving. Think deeply about the problem, understand its root causes, and don’t just stick with the status quo.
Building alliances and seeking mentorship: Successfully navigating project intricacies requires strategic partnerships and mentors. These elements were pivotal in ensuring my project’s success. The unwavering support from my managerial chain was indispensable, and guidance from the organization’s technical leadership played a vital role in aligning the solution effectively.
Keep on learning: Approach learning with an open mind, absorbing knowledge like a sponge. You never know when insights garnered from past experiences might prove valuable. For instance, my familiarity with Spark dates back to 2014, coupled with a continuous interest in studying its evolution; these guided my intuitive thinking on the general design direction.

Beyond these key takeaways, finding joy in the journey is equally (or even more) important. As I reflect on the process of establishing this framework, numerous moments of delight emerge: troubleshooting Spark executor issues alongside UMP Engineers, engaging with various technical leaders to gain diverse design perspectives, and delivering roadshow presentations to many analytics teams. A genuine sense of joy drives me forward and is indispensable throughout the journey. I hope this article extends that joy to you as well.

— — —

If you enjoyed reading this article, feel free to spread the word by liking, sharing, and commenting. Pan is currently a Senior Data Science Manager at Meta Platform Inc., and you may also follow him on LinkedIn.