报告题目：Distributed Computing at Scale: The long-tail problem in a computer system with hundreds of thousands of servers
This talk will address the long-tail problem in a computer system with hundreds of thousands of distributed servers, such as Google’s Cloud datacenter and Alibaba’s Fuxi system. The system is able to run tasks of an application in parallel and hence execute the application rapidly. However, when the application contains many tasks (e.g. from hundreds to thousands),there will be always a small numberof the tasks that make slow or no progress (i.e. forming a long tail), thereby affecting the completion of the application. The industry developed some simple solutions, including kill and re-try, massive task clones, and speculative execution, which are costly and ineffective.
Wefirst investigated the root causes of the long-tail problem by analyzing real-world datacenter tracelogs, including operational data sets from Google, Alibaba and Adapt (UK), anddeveloped a system model that captures a datacenter’sbehavioural characteristics. The model was then trained extensively using the tracelogs, able to predict the system’s run-time behavior in an accurate fashion. Tasks can be now scheduled intelligently based on the behavioural prediction, leading to the efficient execution of an application and at the same time the efficient utilisation of server resources.
Jie Xu is Chair of Computing at the University of Leeds, Executive Board Member of UK Computing Research Committee (UKCRC), and Director of the EPSRC-funded White Rose Grid e-Science Centre, involving the three White Rose Universities of Leeds, Sheffield and York. He was Head of the Institute for Computational and Systems Science at Leeds, and is now Head of Distributed Systems and Services. He has worked in the field of Distributed Computing Systems for over thirty-five years and had industrial experience in building large-scale networked computer systems. Professor Xu now leads a collaborative research team investigating fundamental theories and models for distributed computing systems, and developing advanced Internet and Cloud technologies with a focus on complex system engineering (e.g. with Rolls-Royce and JLR), energy-efficient computing (e.g. with Google and Alibaba), dependable and secure collaboration (e.g. large-scale data processing and analysis for social science and e-healthcare applications with TPP and X-Lab Ltd), and evolving system architectures (e.g. with BAE Systems).
Professor Xu has led or co-led research projects worth a total of over £25M, mainly from the UK Research Councils, TSB/DTI/InnovateUK, JISC and industrial sources, and was the PI of an EPSRC Platform grant. He has published more than 300 academic papers in areas largely related to exploring and building dependable distributed systems, and received many research awards, including the BCS/IEE Brendan Murphy Prize 2001 for the best work in distributed systems and networks, the latest Kane Kim Memorial Prize from IEEE ISORC in 2012, IEEE ISADS Industrial App Award, and IEEE SOSE best paper award. He is an executive board member of several IEEE conferences and TCs, and advises universities such as CUHK and PolyUHK for their research assessment, UK governmental agencies such as EPSRC and DTI (InnovateUK), and industrial leaders including Lenovo, Huawei, and Alibaba. He is the co-founder and director of Edgetic Ltd (UK), a university’s spin-out company, which was awarded a cash investment of over £1.1M in 2017.
Professor Xu received a PhD in Computing Science from the University of Newcastle upon Tyne, and moved to the University of Durham in 1998 as the head founder of the Durham Distributed Systems Engineering group. He was Professor of Distributed Systems at Durham before he joined in 2004 the School of Computing at Leeds.He is also a visiting/guest professor at the University of Newcastle upon Tyne, Beihang University, NUDT, and Chongqing University in China.