As Big Data applications become increasingly complex, there are more and more tasks involved in the end-to-end processing. This talk presents a comprehensive runthrough and solutioning of data pipeline orchestration. We start with the lifecycle of analytics applications, including data sources, pipeline, wrangling, transformation, integration, feature engineering, model training, algorithm evaluation, productionization, deployment, and operations. Then we look into what a typical data pipeline entails, and introduce the concept of data pipeline orchestration. We examine the key features, benefits, pros/cons, major challenges and trends. Subsequently we identify the orchestration goals and related tools. Further, we drill down to leading products, illustrating the main capabilities and functions, followed by comparisons of selected products with highlights of strength and weakness. To help sort out options, we define a taxonomy to classify dozens of orchestrators, along with the selection criteria and usage guidelines. Finally, we go over a case study to demonstrate how to apply the pipeline orchestration in real-world projects. Best practices and lessons learned will be discussed during the session as well.
Tony Shan is a renowned thought leader and innovative visionary with decades of field experience and guru-level expertise on cutting-edge enterprise computing technologies. He leads incubating and nurturing interdisciplinary practice and enablement on emerging technologies like IoT, big data and cloud. He drives award-winning innovation and transformation of most complex enterprise systems. He directs and advises the pragmatic lifecycle design of large-scale distributed solutions on diverse platforms in Fortune 500 companies and public sector organizations. He is a regular speaker and chair in preeminent conferences, and a judge in IT contests. As a book author and editor/editorial advisory board member of technology research journals, he also founded several user groups/forums.