Profiling Large-scale Live Video Streaming and Distributed Applications
Abstract
Today, distributed applications run at data centre and Internet scales, from intensive data
analysis, such as MapReduce; to the dynamic demands of a worldwide audience, such
as YouTube. The network is essential to these applications at both scales. To provide
adequate support, we must understand the full requirements of the applications, which
are revealed by the workloads. In this thesis, we study distributed system applications
at different scales to enrich this understanding.
Large-scale Internet applications have been studied for years, such as social networking
service (SNS), video on demand (VoD), and content delivery networks (CDN). An
emerging type of video broadcasting on the Internet featuring crowdsourced live video
streaming has garnered attention allowing platforms such as Twitch to attract over 1
million concurrent users globally. To better understand Twitch, we collected real-time
popularity data combined with metadata about the contents and found the broadcasters
rather than the content drives its popularity. Unlike YouTube and Netflix where content
can be cached, video streaming on Twitch is generated instantly and needs to be
delivered to users immediately to enable real-time interaction. Thus, we performed a
large-scale measurement of Twitchs content location revealing the global footprint of its
infrastructure as well as discovering the dynamic stream hosting and client redirection
strategies that helped Twitch serve millions of users at scale.
We next consider applications that run inside the data centre. Distributed computing
applications heavily rely on the network due to data transmission needs and the scheduling
of resources and tasks. One successful application, called Hadoop, has been widely
deployed for Big Data processing. However, little work has been devoted to understanding
its network. We found the Hadoop behaviour is limited by hardware resources and
processing jobs presented. Thus, after characterising the Hadoop traffic on our testbed
with a set of benchmark jobs, we built a simulator to reproduce Hadoops job traffic
With the simulator, users can investigate the connections between Hadoop traffic and
network performance without additional hardware cost. Different network components
can be added to investigate the performance, such as network topologies, queue policies,
and transport layer protocols.
In this thesis, we extended the knowledge of networking by investigated two widelyused
applications in the data centre and at Internet scale. We (i)studied the most
popular live video streaming platform Twitch as a new type of Internet-scale distributed
application revealing that broadcaster factors drive the popularity of such platform,
and we (ii)discovered the footprint of Twitch streaming infrastructure and the dynamic
stream hosting and client redirection strategies to provide an in-depth example of video
streaming delivery occurring at the Internet scale, also we (iii)investigated the traffic
generated by a distributed application by characterising the traffic of Hadoop under
various parameters, (iv)with such knowledge, we built a simulation tool so users can
efficiently investigate the performance of different network components under distributed
application
Authors
Deng, JieCollections
- Theses [4235]