The Three Pillars Of Real-World Data Engineering With Vicente Rubin Del Pino Ruiz, Director Data Engineering, UnitedHealth Group
As more industries strive towards being AI-driven, gathering and analyzing vast amounts of data is crucial. As a result, the demand for data engineering continues to grow at rapid speed, leading to increased interest in developing the best teams and strategies for companies.
In a recent Data for AI event, Vicente Rubin Del Pino Ruiz, Director, Data Engineering at UnitedHealth Group, spoke about the key components of implementing data engineering at the organizational level, as well as the greatest challenges that come with managing these processes. Drawing from his experience in real-world data engineering, he shared the most important aspects of data governance, lineage, and preparation.
Goals and Skills of Data Engineering
Industries are beginning to define the right tasks and teams that constitute this emerging field of data engineering but are often overwhelmed by the endless possibilities for tools and technologies. The key, however, for any company is understanding that the data is the main asset. “Everything is about the data,” Vicente emphasizes, “technologies are just tools that you should use depending on your use case, depending on your organization, and depending on how the data is generated and consumed.” Keeping this in mind, businesses and teams can find success by recognizing that no matter the project, the ultimate goal of data engineering is to provide data in a timely manner, in a secure fashion, and with reliability and usability.
To achieve this goal, data engineers must possess a strong skill set of data modeling, data processing, and software engineering techniques. To be a competitive candidate, individuals often feel that they need to be an expert at everything. However finding these data engineering jack of all trades is not necessary. Rather, managers should focus on building a diverse team of varying skills and experiences rather than requiring an impractical knowledge of every tool available.
Data Quality, Data Lineage, Metadata
Equipped with the foundational understanding of data engineering goals and skills, teams should center their work around three key components: data quality, data lineage, and metadata. By focusing on all three aspects, organizations will set their data engineers and projects up for success.
For any data project you need to know the quality behind the data set. After all, the saying “garbage in is garbage out” is true. In regards to data, teams need to be proactive, detecting issues and addressing them before users do, creating an alert system, and overall smart data processing. Additionally, projects should be informative and automatic, which requires making data quality information readily available and eliminating manual intervention from the process of including any new data sources. Teams must focus on gathering the right data, communicating clearly how the data was generated, and verifying the quality of that data.
Next, the process of data lineage should have full coverage, starting from the source system generating the data and reaching to the end point of the pipeline. The process should be open, transparent, and made accessible to everybody through tools such as graph visualization and search engines. Key points for data lineage include knowing the source of the data, the system ownership, the frequency of generation, and any dependencies. Just like in the first key component of data quality, automation of these processes is crucial in saving time in the long run.
The third key component, metadata, provides complete information to the end users about what is in the data set and allows them to make decisions about the data. Making information available and easy to access is critical in the continued success of data engineering and business.
Challenges in Data Engineering
Even with a strong understanding of key components in data engineering outlined above, companies still face challenges with data project lifecycles.One challenge lies in constructing the ideal data engineering teams. In such a competitive market, it is especially difficult to put together a strong team, and it can often be tempting to prioritize speed of hiring out of necessity. However, Vicente emphasizes the importance of building a team with the right skill set, explaining that managers should focus on identifying the necessary skills and building a group of people, not an individual person, that covers all of them. Although this may take more time, facilitating the transfer of knowledge between different people and hiring the right mix of senior and junior engineers will set the team up for success.
Another challenge faced by organizations is in deciding the right architecture for the specific area and the right technology for the job. Vicente highlights the importance of centering the tools around the specific use case rather than starting with the hottest technologies of the moment. Teams should follow the structure of their organizations and employ the architectural design that maximizes the usage of team skills.
Requirements is another significant challenge area, and the lack of documentation in data sources and information about data access can be a major hurdle. This is why teams should again strive to focus on data governance, quality, and metadata throughout each step of the process. When data engineering teams start achieving success in many projects, the consequent challenge becomes prioritization. This is again where the idea of data governance comes into play. Teams need to decide which data set is the right one to process first and then plan appropriately. To support prioritization, constructing appropriately sized teams for the number of projects and necessary skills is crucial.
As data engineering teams continue to grow and take on more projects, the need to avoid technical debt, which is the idea of the cost of choosing temporary, short-term developments rather than more comprehensive, long-term approaches is crucial. Cutting corners may provide short term gains but long term pains. Even if it takes more time at the moment, teams must focus on building generic and reusable processes driven by configuration and automating any repetitive task. Once processes are in production, they need support, as everything will eventually fail. This entails allocating the capacity support, building an alert system, and implementing the best practices for building easy to re-run processes. If a support system is put in place early on, when something does not go as expected, teams can more easily identify the relevant issues and get the project back on track.
Although the field of data engineering continues to grow rapidly and in unexpected ways, teams that focus on the three key data pillars will be resilient. Don’t lose sight of data quality, data lineage, and data governance. With this foundation,teams can focus on the long term goals to make data engineering more efficient and valuable to the business.