ดารฟต์1

This commit is contained in:
2018-03-11 16:44:31 +00:00
parent 85f944d4d8
commit b46a568197
14 changed files with 1769 additions and 84 deletions

View File

@@ -1,7 +1,6 @@
\documentclass{l4proj}
\usepackage{url}
\usepackage{natbib}
\usepackage{hyperref}
\usepackage{fancyvrb}
\usepackage[final]{pdfpages}
@@ -59,9 +58,7 @@
\maketitle
\begin{abstract}
% TODOO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO
We show how to produce a level 4 project report using latex and pdflatex using the
style file l4proj.cls
In the past few years, data visualisation tools on the web are becoming more popular. D3, a JavaScript library, has a module that focuses on simulating physical forces on particles for creating force-directed layouts. However, the currently-available algorithm does not scale very well for multidimensional scaling. To solve the problem, the Hybrid Layout algorithm and its pivot-based near neighbour search enhancement was implemented and integrated with the D3 module. The existing D3's algorithm and Bartasius' implementation of Chalmers' 1996 algorithm were also optimised for the use case and compared against the Hybrid algorithm. Furthermore, experiments were also performed to evaluate the impact of each user-defined parameters. The results show that for larger data sets, the Hybrid Layout consistently produces fairy good layouts while using less amount of time. It is also capable of working on larger data sets, compared to D3's algorithm.
\end{abstract}
%\educationalconsent
@@ -81,10 +78,39 @@ style file l4proj.cls
\pagenumbering{arabic} % ONLY DO THIS AT THE FIRST CHAPTER
\section{Motivation}
Talk about how data set are bigger now, before 1M data, not much. Now, just a laptop is ughhhh.
Many approach, some map many features to 2 D, this is based on distance. the coor isnt important
In the age of Web 2.0, new data are being generated at an overwhelming speed. Raw data made up of numbers, letters, and boolean values are hard for humans to comprehend and infer any relation from it. To make it easier and faster for us, humans, to understand certain data, various techniques were created to map raw data to a visual representation.
A data set may have many dimensions while humans live in a 3D space, leading to the challenge of dimensionality scaling. There are many approaches to this problem, each with its own pros and cons. One of the approach is multidimensional scaling (MDS) which hi-light the similarity and clustering of data to the audiences. The idea is to map a data point to a particle in 2D space and place them in a way that the distance between each pair of particle in 2D space represents the distance in high-dimensional space.
With the recent trend of moving away from traditional native applications to easily-accessible cross-platform web applications, many data visualisation toolkit for JavaScript such as Google Charts, Chart.js and D3.js, are emerging. With these frameworks, it is easier for website designers and content creators to create interactive attention-grabbing infographics, allowing more people to understand their works with less cognitive load required.
\begin{figure}[h]
\centering
\includegraphics[height=4.5cm]{d3-samples/d3-horizons-chart.png}
~
\includegraphics[height=4.5cm]{d3-samples/d3-radial-box.png}
\includegraphics[height=4.5cm]{d3-samples/d3-wordcloud.png}
~
\includegraphics[height=4.5cm]{d3-samples/d3-sunburst-partition.png}
\caption{Several different data visualisation based on the D3 framework: Horizon Chart (top left), Radial Boxplot (top right), Word Cloud (lower left) and Sunburst Partition (bottom right).}
\label{fig:intro_d3Samples}
\end{figure}
One of the most popular free open-source data visualisation library is Data Driven Documents\cite{D3}\cite{D3Web}. The premise is to bind an arbitrary raw data to a web page content and then apply data-driven transformations, breathing life into it, all while only using standard web technologies and avoiding any restriction from proprietary software. This makes the library highly accessible, allowing the application to reach wider audience.
The D3-Force module, part of the D3 library, provide a framework for simulating physical forces on particles. Along with that, a spring model algorithm was also implemented to allow for creation of a force-directed layout. While the implementation is fast for several thousands particles, it does not scale well with larger data set, both in term of memory and time complexity. By solving these issues, the use cases covered by the module would expand to support very large data sets. The motivation of the project is to improve this scalability issues with better algorithms from the School of Computing Science.
\section{Project Description}
University of Glasgow's School of Computing Science have some of the fastest force-directed layout drawing algorithm in the world. Some of these are Chalmers' 1996 Neighbour and Sampling technique ($O(N^2)$)\cite{Algo1996}, 2002 Hybrid Layout algorithm\cite{Algo2002} and its 2003 enhanced variant\cite{Algo2003}. These algorithms provide huge improvement, both in term of speed and memory complexity. However, these algorithms are only implemented in an older version of Java which limits its practical use. In 2017, Bartasius have implemented the 1996 algorithm along with several others and a visual interface in order to compare each algorithm against the other\cite{LastYear}.
In short, the goal of the project is to
\begin{itemize}
\item implement Hybrid Layout algorithms from School of Computing Science in JavaScript
\item integrate the implementation into D3 framework and Bartasius' tool set
\item optimise existing implementation of basic sprint model and Chalmers' algorithm for fair comparison
\item evaluate and compare each algorithm another
\end{itemize}
\section{Outline}
The remainder of the report will discuss the following:
@@ -93,7 +119,7 @@ The remainder of the report will discuss the following:
\item \textbf{Design} This chapters discusses choice of technologies.
\item \textbf{Implementation} This chapter will briefly show decisions and justifications made during the implementation, along with several code snippets.
\item \textbf{Evaluation} This chapter will detail the process used to compare the performance of each algorithm, starting from the experiment design to the final result.
\item \textbf{Conslusion} This chapter give a brief summary of the project, reflect on the process in general, and discusses possible future improvements.
\item \textbf{Conclusion} This chapter give a brief summary of the project, reflect on the process in general, and discusses possible future improvements.
\end{itemize}
@@ -106,24 +132,45 @@ The remainder of the report will discuss the following:
%%%%%%%%%%%%%%%%
\chapter{Background}
\label{ch:bg}
History of data vis, MDS, spring model, some other methods including parameters mapping like radar chart
Before, there was linear combination, weakness?
For small dimension, can use radar chart, or multi bar chart
t-sne, weakness is slow
linear combination method
\section{Link force}
\label{sec:linkbg}
D3 library, which will be described in section \ref{ssec:d3design}, have a several different force models implemented in its Force module for creating a force directed graph. One of them is link force. In this model, a force is applied between the two nodes at the end of each link. The force pushes the nodes together or apart with varying strength, proportional to the error between the desired and current distance on the graph. Essentially, it is the same basis as the spring model. An example of a graph produced by the D3 link force is shown in figure \ref{fig:bg_linkForce}.
With the emergence of complex data with more than 3 dimensions, there are more need to map the high-dimensional data down to 2D space. Figure \ref{fig:bg_many_multidimension} shows several approaches to the problem. One of the earliest method is to align graphs on the basis of one axis they all share. While it is still being used, the use cases are limited due to all graphs having to share an axis. On the other hand, scatterplot matrix performs scatterplot of every pair of dimensions, allowing users to see relations between many different dimensions. However, the screen space usage also rises quadratically, making it unsuitable for high-dimensional data.
\begin{figure}[h]
\centering
\includegraphics[height=9.2cm]{images/d3-force.png}
\caption{An example of a graph produced by D3 link force.}
\label{fig:bg_linkForce}
\centering
\begin{subfigure}{0.45\textwidth}
\includegraphics[height=6cm]{d3-samples/d3-single-axis-composition.png}
\caption{Single-axis composition}
\end{subfigure}
\begin{subfigure}{0.45\textwidth}
\includegraphics[height=6cm]{d3-samples/d3-scatterplot-matrix.png}
\caption{Scatterplot Matrix}
\end{subfigure}
\caption{Different approaches to visualise high-dimensional data}
\label{fig:bg_many_multidimension}
\end{figure}
The link force algorithm is inefficient. In each time step (iteration), a calculation have to be done for each pair of nodes connected with a link. This means that for our use case with a fully-connected graph (where every node is connected to every other node) of $N$ nodes, the algorithm will have to perform $N(N-1)$ force calculations per iteration, essentially $O(N^2)$. It is also believed that the number of iterations required to create a good layout is proportional to the size of the data set, hence the total time complexity of $O(N^3)$.
Unlike other introduced techniques, Multidimensional scaling (MDS) aims to reduce data dimension by preserving the level of similarity, rather than the values.
Classical MDS\cite{cMDS} achieves this goal by creating new dimensions for scatter-plotting, each made up of a linear combination of the original dimensions, while minimising a loss function called strain, a function similar to stress.
Because strain assumes Euclidean distances, making it incompatible with other dissimilarity ratings, metric multidimensional scaling improves upon classical MDS by generalising the solution to support a variety of loss functions\cite{mcMDS}. However, the disadvantage of $O(N^3)$ time complexity still remains and the limitation of linear combination can be apparent in some data sets.
This project focuses on several non-linear MDS algorithms using force-directed layout. The idea is to attach each pair of data points with a spring with the equilibrium length proportional to the high-dimensional distance between the two points, although the spring model we know today does not use the Hooke's law to calculate the spring force\cite{Eades}. Several improvements have been introduced to the idea over the past decade. For example, the concept of 'temperature' purposed by Fruchterman and Reingold\cite{SpringTemp} solves the problem where the system is unable to reach an equilibrium state and improves execution time. The project focuses on an iterative spring-model-based algorithm introduced by Chalmers\cite{Algo1996} and the hybrid approach which will be detailed in subsequent sections of this chapter.
A number of non-linear MDS algorithms were also introduced in the last few years. t-distributed Stochastic Neighbour Embedding (t-SNE)\cite{tSNE}, for example, is very popular in the field of machine learning. It is based on SNE\cite{SNE} where probability distributions are constructed over each pair of data point in a way that the more similar objects have higher probability of being picked. The distributions derived from high-dimensional and low-dimensional distances are then compared and the 2D placement of each data points are then iteratively adjusted to minimise the difference between the two distributions. The downside is that it have both time and memory complexity of $O(N^2)$ per iteration. In 2017, Bartasius\cite{LastYear} implemented t-SNE in D3 and found that not only is it the slowest algorithm in his test, the produced layout is also the many times worse in term of Stress metric, which will be introduced in section \ref{sec:bg_metrics}.
The rest of this chapter will describes each algorithm and performance metrics used in this project in detail.
\section{Link Force}
\label{sec:linkbg}
D3 library, which will be described in section \ref{sec:des_d3}, have a several different force models implemented in its Force module for creating a force directed graph. One of them is link force. In this model, a force is applied between the two nodes at the end of each link. The force pushes the nodes together or apart with varying strength, proportional to the error between the desired and current distance on the graph. Essentially, it is the same basis as the spring model. An example of a graph produced by the D3 link force is shown in figure \ref{fig:bg_linkForce}. In MDS where the high-dimensional distance between every pair of nodes can be calculated, a link will be created to represent the distance, resulting in a complete graph.
\begin{figure}[h]
\centering
\includegraphics[height=9.2cm]{d3-samples/d3-force.png}
\caption{An example of a graph produced by D3 link force.}
\label{fig:bg_linkForce}
\end{figure}
The link force algorithm is inefficient. In each time step (iteration), a calculation have to be done for each pair of nodes connected with a link. This means that for MDS with $N$ nodes, the algorithm will have to perform $N(N-1)$ force calculations per iteration, essentially $O(N^2)$. It is also believed that the number of iterations required to create a good layout is proportional to the size of the data set, hence the total time complexity of $O(N^3)$.
The model also cache the desired distance of each link in memory to improve the speed across many iterations. While this greatly reduces the number of calls to the distance-calculating function. the memory complexity also increases to $O(N^2)$. Because the JavaScript memory heap is limited, it runs out of memory when trying to process a fully-connected graph of more than a ten thousands data points.
\section{Chalmers' 1996 algorithm}
@@ -141,8 +188,6 @@ Previous evaluations indicated that the quality of the produced layout improves
\label{sec:bg_hybrid}
In 2002, Alistair Morrison, Greg Ross, and Matthew Chalmers introduced a multi-phase, based on Chalmers' 1996 algorithm, to reduce the run time down to $O(N\sqrt{N})$. This is achieved by calculating the spring forces over a subset of data, and interpolating the rest onto the 2D space\cite{Algo2002}.
%TODO Maybe history of hybrid layout, 3rd section on original paper
In this hybrid layout method, the $\sqrt{N}$ sample objects ($S$) are first placed on the 2D space, using the 1996 algorithm. The complexity of this step is $O(\sqrt{N}\sqrt{N})$ or $O(N)$. After that, each of the other objects $i$ are then interpolated as described below.
\begin{enumerate}
\item \label{step:hybridFindPar} Find the 'parent' object $x\in{S}$ with the minimal high-dimensional distance to $i$. This is essentially a nearest neighbor searching problem.
@@ -213,7 +258,7 @@ To compare different algorithms they have to be tested against the same set of p
\end{itemize}
\section{Summary}
In this chapter, different techniques of multidimensional scaling have been explored. As the focus of the project is on three spring model algorithms, the theory of each of the method have been discussed. Finally, in order to measure the performance of each algorithm, different metrics were introduced and will be used for the evaluation process.
In this chapter, several techniques of visualising multidimensional data have been explored. As the focus of the project is on three spring-model-based algorithms, the theory of each of the method have been discussed. Finally, in order to measure the performance of each algorithm, different metrics were introduced and will be used for the evaluation process.
%==============================================================================
%%%%%%%%%%%%%%%%
@@ -223,21 +268,73 @@ In this chapter, different techniques of multidimensional scaling have been expl
%%%%%%%%%%%%%%%%
\chapter{Design}
\label{ch:design}
something
This chapter discusses decisions for selecting technologies and libraries during the development process. It also briefly describes each technology available alternatives, and Bartasius' application which this project is built upon.
\section{Technologies}
With the goal of reaching as many audience as possible, the project advisor set a requirement that the application must run on a modern web browser. This section briefly introduce web technologies used to develop the project.
%============================
\subsection{HTML, CSS, and SVG}
HTML and CSS are the two core technologies used to build web pages. HTML (Hypertext Markup Language) describes the structure and content of a web page. CSS (Cascading Style Sheets) defines the visual layout of the page. The latest major version of the standards are HTML 5 and CSS3, both of which are currently supported by all major web browsers. Modern web applications can not avoid these standards, and this project is no exception. Aside from user interface, this project rely heavily on SVG support in the HTML standard to render the produced layout.
%Cite HTML, CSS,SVG
SVG (Scalable Vector Graphics) is an open XML-based vector image format. HTML 5 allow SVG to be embedded directly in an \texttt{<svg>..</svg>} tag. In this project, an SVG is used as a base canvas to display the produced graphics. Each data point is then drawn as a circle as shown in figure \ref{fig:des_svgobject}.
\begin{figure}[h]
\centering
\includegraphics[height=5cm]{images/svgobject.png}
\caption{An example of SVG document representing data points.}
\label{fig:des_svgobject}
\end{figure}
%============================
\subsection{Javascript}
%============================
\subsection{Data Driven Document}
\label{ssec:d3design}
%============================
\subsection{Bartasius' D3 Neighbour Sampling plug-in}
\subsection{JavaScript}
\label{ssec:des_js}
JavaScript is the most common high-level scripting language for web pages. It is dynamic, untyped and multi-paradigm, supporting event-driven, functional, prototype-based, and object-oriented programming style.
It mostly runs on client's browser interpreter. Many APIs are designed for manipulating the HTML content, allowing programmers to create dynamic web pages by changing contents according to a variety of events.
%Cite JS
Alternative languages such as CoffeeScript and TypeScript are emerging, each adding more features, syntactic sugars, or syntax changes to improve code readability. However, in order to run these languages on browsers, they have to be compiled back to JavaScript. The learning resources availability is also leagues behind JavaScript. For these reasons, standard JavaScript was chosen for this project.
Being an interpreted high-level language, it is relatively slow and is only single-threaded due to limitations from standard APIs. asm.js is an effort to optimize JavaScript by using only a restricted subset of features. It is intended to be a compilation target from other statically-typed languages such as C/C++ rather than a language to code in\cite{asmjs}. Existing JavaScript engines may gain performance from asm.js' restrictions such as preallocated heap, reducing the load on the Garbage Collector. Firefox and Edge also recognize asm.js and compile the code to assembly ahead-of-time (AOT) to eliminate the need to run code through interpreter entirely, resulting in a significant speed increase\cite{asmjsSpeed}. However, the D3 library is still using standard JavaScript. A large chunk of the library have to be ported in order to be able to compare different algorithms fairly. Since a lot of effort is required to potentially significantly improve performance on 2 browsers and marginally on others, asm.js was not selected for this project.
WebAssembly (wasm) is another recent contender. Unlike JavaScript, it is a binary format designed to run with JavaScript on the same sandboxed stack machine\cite{WebAssembly}. Similar to asm.js, it is intended to be a compile target. With support for additional CPU instructions not available in JavaScript, it also perform predictably better than asm.js. Only recently exited the preview phase in March 2017, the support was not widespread and learning resources was hard to find. It also inherit a risk of not being widely adopted by browsers. As a result, WebAssembly was not considered as a viable option.
%============================
\section{Input Data and Parameters}
\section{Data Driven Document}
\label{sec:des_d3}
Data Driven Documents (D3 or D3.js)\cite{D3}\cite{D3Web} is one of the most popular JavaScript library for interactive data visualisations in web browsers. The focus is to bind data to DOM (Document Object Model) elements in HTML or SVG and apply data-driven transformations to make the visualisation visually appealing. Its modular and free open-source nature also makes it flexible. Many visualisation algorithms can be easy integrated into it. In this project, aside from Force Link algorithm, the complicated process of translating velocities and location onto an SVG document are handled by the D3 library, allowing the project to just focus on the algorithms.
There are several other data visualisation libraries such as Google Charts and Chart.js. However, most of them do not support force-directed layout and are not as flexible. In addition to being a requirement set by the project advisor, D3 is the only sensible choice for this project.
%============================
\section{Bartasius' D3 Neighbour Sampling plug-in}
In 2017, Bartasius implemented the Chalmers' 1996 algorithm and several others algorithms for the level 4 project at the School of Computing Science. All source files are released on GitHub under the MIT license. To reduce the amount of duplicated work, the project advisor recommended using the repository as a groundwork to implement other algorithms upon.
%============================
\subsection{Input Data}
The data is one of the most important element of the project. Without it, nothing can be visualised. Since the data may consist of many different features (attributes), each with a unique name, it makes sense to store each data point (node) as an JavaScript object, a collection of key:value pairs. To conform with D3 API, all nodes are stored in a list (array).
Two example data structures are shown in figure \ref{fig:des_jsobject}
\begin{figure}
\centering
\includegraphics[height=7cm]{images/jsobj.png}
\caption{Examples of the data structure used to store the input data. On the left display nodes from the Poker Hand data set. Right shows nodes from the Antartica data set.}
\label{fig:des_jsobject}
\end{figure}
%============================
\subsection{Graphical User Interface}
Due to the sheer amount of experiments to run, manually changing function and file names between each run is a tedious task. Bartasius developed a GUI for the plug-in to ease the testing process. For this project, new modifications have been made to accommodate newly implemented algorithms.
Figure \ref{fig:des_gui} shows the modified GUI used in this project. At the top is the canvas to draw the produced layout. The controls below are then divided into 3 columns. The left column controls data set input, rendering, and iterations limit. The middle column are a set radio and slider buttons for selecting the algorithm and parameters to use. The right contains a list of distance functions to choose from.
\begin{figure}
\centering
\includegraphics[height=10cm]{images/GUI.png}
\caption{The graphical interface.}
\label{fig:des_gui}
\end{figure}
%============================
%==============================================================================
@@ -250,9 +347,9 @@ something
\label{ch:imp}
\section{Outline}
D3-force module provide a simplified Simulation object to control various calculations. Each Simulation contain data point nodes and Force objects. Interfaces were defined, allowing each Force to access the node list. To keep track of positions, each node will be assigned values representing its current location and velocity vector. These values can then be used by the application to draw a graph. In each constant unit time step (iteration), the Simulation will trigger a function in each Force, allowing them to add values to each particle's velocity vector, which will then be added to the particle's position.
D3-force module provide a simplified Simulation object to control various calculations. Each Simulation contain data point nodes and Force objects. Interfaces were defined, allowing each Force to access the node list. To keep track of positions, each node will be assigned values representing its current location and velocity vector. These values can then be used by the application to draw a graph. In each constant unit time step (iteration), the Simulation will trigger a function in each Force object, allowing them to add values to each particle's velocity vector, which will then be added to the particle's position.
Because D3-force are libraries to be built into other web applications, the algorithms implemented can not be used on their own. Fortunately, as part of Bartasius' Level 4 project in 2017, a web application for testing and evaluation has already been created with graphical user interface designed to allow the user to easily select an algorithm, data set, and parameter values. Various distance functions, including one specifically created to handle the Poker Hands data set\cite{UCL_Data} which will be used for evaluation (section \ref{sec:EvalDataSet}), are also in place and fully functional.
Because D3-force are libraries to be built into other web applications, the algorithms implemented can not be used on their own. Fortunately, as part of Bartasius' level 4 project in 2017, a web application for testing and evaluation has already been created with graphical user interface designed to allow the user to easily select an algorithm, data set, and parameter values. Various distance functions, including one specifically created to handle the Poker Hands data set\cite{UCL_Data} which will be used for evaluation (section \ref{sec:EvalDataSet}), are also in place and fully functional.
The csv-formatted data file can be loaded locally. Next, it is parsed using Papa Parse JavaScript library\cite{PapaParse} and then put on the simulation.
Depending on the distance functions, per-dimension mean, variance, and other attributes may also be calculated as well. These values are used in general distance functions to scale values of each feature properly. The D3 simulation layout is shown on an SVG canvas with zoom functionality to allow graph investigation. The distance function scaling was tweaked to only affect rendering and not the force calculation.
@@ -324,6 +421,8 @@ After optimisation, the execution time decreases marginally while memory consump
\label{fig:imp_linkComparison}
\end{figure}
Next, the \texttt{jiggle()} function was assessed. As shown in line 5-7 of code \ref{lst:impl_LinkD3}, in cases where two nodes are projected to be on the exact same location, \texttt{x}, \texttt{y} and, in-turn, \texttt{l}, could be 0. This would cause a divide-by-zero error in line 8. Rather than throwing an error, JavaScript would return the result as Infinity. Any subsequent arithmetic operations, except for modulus, with other numbers will results in either Infinity or -Infinity, effectively deleting the coordinate and velocity values from the entire system. To prevent such error, when \texttt{x} or \texttt{y} is calculated to be zero, D3 will replace the values with a very small random number generated by \texttt{jiggle()} instead. While extremely unlikely, there is still a chance that \texttt{jiggle()} will random 0. This case can rarely be observed when every nodes are initially placed at the exact same position. To counter this, I modified \texttt{jiggle()} to re-random a number until a non-zero value is found.
Finally, a feature is added to track the average force applied to the system in each iteration. A threshold value can be set so once average force falls below the threshold, a user-defined function is be called. In this case, a handler is added to Bartasius' application to stop the simulation. This is feature will be heavily used in the evaluation process (section \ref{ssec:eval_termCriteria}).
%============================
\subsection{Chalmers' 1996}
@@ -375,7 +474,7 @@ The D3 API extensively with the Method Chaining design pattern. The main idea is
}
\end{lstlisting}
As shown in code \ref{lst:impl_HybridUsage}, the parameters for each Chalmers' force objects are set in advance by the user. This potentially allow other force calculation objects to be used in place of the current one without having to modify the Hybrid object. To terminate the Chalmers' force in the first and last phase, the Hybrid object have an internal iteration counter to stop the force calculations after predefined period of time. In addition, the applied force threshold events are also supported as an alternative termination criteria.
As shown in code \ref{lst:impl_HybridUsage}, the algorithm-specific parameters for each Chalmers' force objects are set in advance by the user. Since the Hybrid object interacts with the Simulation and force-calculation objects via common interfaces, other force calculators could potentially be used without having to modify the Hybrid object as well. In fact, D3's original implementation of Force Link also works with the Hybrid object. To terminate the forces in the first and last phase, the Hybrid object have an internal iteration counter to stop the force calculations after predefined period of time. In addition, the applied force threshold events are also supported as an alternative termination criteria.
For interpolation, two separate functions were created for each method. After the parent is found, both functions call the same third function to handle the rest of the process (step 2 to 8 of in section \ref{sec:bg_hybrid}).
@@ -495,7 +594,7 @@ An alternative method is to stop when a condition is met. One such condition pur
\label{fig:eval_stressVeloOverTime}
\end{figure}
Since stress takes too long to calculate every iteration, termination criteria selected is the average force applied per node. This criteria is used for all 3 algorithms for consistency. The cut-off constant is then manually selected for each algorithm for each subset used. Link force's threshold is a value low enough that there are no visible changes and stress have reached near minimum. The Chalmers' threshold is the lowest possible value that will be reached most of the time. It is interesting to note that with bigger subset of the Poker Hands data set, the threshold rises to 0.66 from 3,000 data points onward.
Since stress takes too long to calculate every iteration, termination criteria selected is the average force applied per node. This criteria is used for all 3 algorithms for consistency. The cut-off constant is then manually selected for each algorithm for each subset used. Link force's threshold is a value low enough that there are no visible changes and stress have reached near minimum. The Chalmers' threshold is the lowest possible value that will be reached most of the time. It is interesting to note that with bigger subset of the Poker Hands data set, the threshold rises and converges to 0.66 from 3,000 data points onward.
By selecting this termination condition, the goal of the last phase of the Hybrid Layout algorithm is flipped. Rather than performing the Chalmers' algorithm over the whole dataset to correct interpolation errors, the interpolation phase's role is to help the final phase reaches stability quicker. Thus, parameters of the interpolation phase can not be evaluated on their own. Taking more time to produce a better interpolation result may or may not effect the number of iterations in the final phase, creating the need to balance between time spent and saved by interpolation.
@@ -687,7 +786,7 @@ As for the stress, a relative value is used for comparison. Figure \ref{sfig:eva
Comparing the produced layout, at 10,000 data points (figure \ref{fig:eval_Poker10k}), Hybrid can better reproduce the space between large clusters as seen in the Link Force's layout. For example, "Unrecongnized" (blue) and "One pair" (orange) have a clearer gap; "Two pairs" (green) and "Three of a kind" (red) overlap less; "Three of a kind" and "Straight" (brown) mixes together in Chalmers' layout but more separated in the Hybrid layout. However, for other classes with less data points (colored brown, purple, pink, ...), the hybrid layout fail to form a cluster, causing them to spread out even more. The same phenomenon can be observed at 100,000 data points (figure \ref{fig:eval_Poker100k}).
\begin{figure} % Poker 100k
\begin{figure}[h] % Poker 100k
\centering
\begin{subfigure}[t]{0.6\textwidth}
\includegraphics[width=\textwidth]{layout/Poker100kNeighbour.png}
@@ -702,33 +801,9 @@ Comparing the produced layout, at 10,000 data points (figure \ref{fig:eval_Poker
\label{fig:eval_Poker100k}
\end{figure}
Moving to the Antartica data set with a more complicated pattern, all three algorithms produces a very similar results. The big clustering difference is located around top center of the image. In Link Force, day 17 (brown) and 18 (lime) are lined up clearly, compared to others that fail to replicate the fine detail. Hybrid layout also fail to distinguish day 17, 18, 19 (pink) and 20 (grey) from each other in that corner. Aside from that, Hybrid form a layout slightly more similar to that of Force Link. Considering that the time used by Link Force, 1996 and Hybrid are approximately 14, 8, and 3.2 seconds respectively, it is hard to argue against using the Hybrid layout.
\begin{figure} % Antartica
\centering
\begin{subfigure}[t]{0.6\textwidth}
\includegraphics[width=\textwidth]{layout/AntarticaLinkDay.png}
\caption{Link Force}
\end{subfigure}
\begin{subfigure}[t]{0.6\textwidth}
\includegraphics[width=\textwidth]{layout/AntarticaNeighbourDay.png}
\caption{Chalmers' 1996}
\end{subfigure}
\end{figure}
\begin{figure}
\centering
\ContinuedFloat
\begin{subfigure}[t]{0.6\textwidth}
\includegraphics[width=\textwidth]{layout/AntarticaHybridDay.png}
\caption{Hybrid Layout}
\end{subfigure}
\caption{Visualisations of the Antartica data set, color-keyed to Day.}
\label{fig:eval_Antartica}
\end{figure}
The area where the 1996 and Hybrid algorithm fall short is the consistency in the layout quality with smaller data points. Sometime, both algorithms stops at the local minimum stress, instead of than global, resulting in an inaccurate result. Figure \ref{fig:eval_IrisBad} and \ref{fig:eval_Poker100Bad} shows examples of such occurrence. If the 1996 algorithms were allowed to continue the calculation, the layout will eventually reach the true stable position, depending on when a right combination of $Samples$ set is randomized to trip the system off its local stable position.
\begin{figure} % Iris BAD
\begin{figure}[h] % Iris BAD
\centering
\begin{subfigure}[t]{0.45\textwidth}
\includegraphics[height=4cm]{layout/IrisNeighbour.png}
@@ -757,6 +832,34 @@ The area where the 1996 and Hybrid algorithm fall short is the consistency in th
\caption{Variations of the result from the Hybrid on 100 data points from the Poker Hands data set with the same parameters.}
\label{fig:eval_Poker100Bad}
\end{figure}
\begin{figure} % Antartica
\centering
\begin{subfigure}[t]{0.6\textwidth}
\includegraphics[width=\textwidth]{layout/AntarticaLinkDay.png}
\caption{Link Force}
\end{subfigure}
\end{figure}
\begin{figure}
\centering
\ContinuedFloat
\begin{subfigure}[t]{0.6\textwidth}
\includegraphics[width=\textwidth]{layout/AntarticaNeighbourDay.png}
\caption{Chalmers' 1996}
\end{subfigure}
\end{figure}
\begin{figure}
\centering
\ContinuedFloat
\begin{subfigure}[t]{0.6\textwidth}
\includegraphics[width=\textwidth]{layout/AntarticaHybridDay.png}
\caption{Hybrid Layout}
\end{subfigure}
\caption{Visualisations of the Antartica data set, color-keyed to Day.}
\label{fig:eval_Antartica}
\end{figure}
Moving to the Antartica data set with a more complicated pattern, all three algorithms produces a very similar results (figure \ref{fig:eval_Antartica}). The big clustering difference is located around top center of the image. In Link Force, day 17 (brown) and 18 (lime) are lined up clearly, compared to others that fail to replicate the fine detail. Hybrid layout also fail to distinguish day 17, 18, 19 (pink) and 20 (grey) from each other in that corner. Aside from that, Hybrid form a layout slightly more similar to that of Force Link. Considering that the time used by Link Force, 1996 and Hybrid are approximately 14, 8, and 3.2 seconds respectively, it is hard to argue against using the Hybrid layout.
%============================
\section{Summary}
@@ -779,15 +882,19 @@ Overall, these algorithms are all valuable tools. It depends on the developer to
\chapter{Conclusion}
\label{ch:conc}
\section{Summary}
In total, the following is the list of
\section{Summary on the project achievements}
The following is the summarized list of work from the beginning to the end, over the course of two semesters.
\begin{itemize}
\item \textbf{D3 Link Force}
\item \textbf{d3-neighbour-sampling plug-in}
\item \textbf{Interpolation algorithms for hybrid layout}
\item \textbf{Hybrid simulation controller object for D3}
\item \textbf{Evaluate the impact of each interpolation parameter} (both independently and as a whole)
\item \textbf{Evaluate memory, time ,and final layout}
\item \textbf{Studied algorithms:} Each algorithm and relevant researches were studied and understood.
\item \textbf{Researched and assessed libraries:} Open-source JavaScript libraries were looked into. D3-force module was inspected and assessed for potential faults.
\item \textbf{Modified D3 Link Force:} The D3 Link Force implementation was forked and optimized for use with a complete graph such as multidimensional scaling using spring model. The applied force tracker was also added, allowing the user to stop the simulation once the force stabilised.
\item \textbf{Modified d3-neighbour-sampling plug-in:} Chalmers' implementation was tweaked to scale the applied force against a constant, rather than relying on a decreasing value. The applied force tracker was added and the evaluation application interface was updated to include newer algorithms.
\item \textbf{Implemented interpolation algorithms for hybrid layout:} The interpolation algorithm was implemented with support for both pivot-based and brute-force pivot finding.
\item \textbf{Implemented Hybrid simulation controller:} A JavaScript object was created as part of the plug-in to control a D3 Simulation object through the 3 phases of Hybrid layout algorithm.
\item \textbf{Evaluated interpolation parameters:} Since the interpolation process have many parameters, several values were tested and the impacts were evaluated, both independently and as a whole system. A good combination of parameters was found after experiments.
\item \textbf{Compared the three algorithms:} Each algorithm's strength and weakness was identified and compared against each other. Link Force was found to only work well in small data set but does not scales while Hybrid Layout only perform nicely on larger ones.
\end{itemize}
\section{Learning Experience}
@@ -798,20 +905,21 @@ As a result of evaluating this project, I believe that I have a better understan
\section{Future Work}
There are several areas of the project that was not throughly explored or could be improved. This section show several directions that can enhance the application.
\begin{itemize}
\item \textbf{Incorporating Chalmers' 1996 and Hybrid interpolation algorithms into D3 framework} Currently, all the implementations are published on a publicly-accessible self-hosted Git Server as a D3 plug-in. While the hybrid model seems to make more sense as a user application implementation, the improved Chalmers' algorithm and the interpolation functions could be integrated to the core functionality of the D3 library.
\item \textbf{Data Exploration Test} The project focuses on overall layouts produced by each algorithm and a single Stress metric. One of the goal of MDS is to explore data, which is not been assessed. A good tool and layout should help users identify patterns and meanings behind small clusters with less effort. The project could be extended to include data investigation tools.
\item \textbf{Data Sets} The evaluation focuses on only 1 data set. It is possible that the algorithms could behave differently on different dataset with different dimensionality, data types and distance functions. Hence, findings in chapter \ref{ch:eval} may not apply to all.
\item \textbf{Optimal parameters generalisation} So far, only good combinations of parameters were determined for a specific data set. These values may not be universally optimal and can vary from data set to data set. Even the threshold value to stop Chalmers' algorithm also varies for different size of subset of the same Poker Hands data set. Future researches could be conducted to find the relation between these parameters to other information about the data set.
\item \textbf{GPU rendering} The use of GPU for general-purpose computing (GPGPU) is gaining popularity because GPU can perform simple calculations in parallel much faster than CPU. In 2017, Khronos group have introduced WebCL\cite{WebCL}, OpenCL for web browsers. However, it have never gained any popularity and was not adopted by any browser.
\item \textbf{Incorporating Chalmers' 1996 and Hybrid interpolation algorithms into D3 framework:} Currently, all the implementations are published on a publicly-accessible self-hosted Git Server as a D3 plug-in. While the hybrid model seems to make more sense as a user application implementation, the improved Chalmers' algorithm and the interpolation functions could be integrated to the core functionality of the D3 library.
\item \textbf{Data Exploration Test:} The project focuses on overall layouts produced by each algorithm and a single Stress metric. One of the goal of MDS is to explore data, which is not been assessed. A good tool and layout should help users identify patterns and meanings behind small clusters with less effort. The project could be extended to include data investigation tools.
\item \textbf{Data Sets:} The evaluation focuses on only 1 data set. It is possible that the algorithms could behave differently on different dataset with different dimensionality, data types and distance functions. Hence, findings in chapter \ref{ch:eval} may not apply to all.
\item \textbf{Optimal parameters generalisation:} So far, only good combinations of parameters were determined for a specific data set. These values may not be universally optimal and can vary from data set to data set. Even the threshold value to stop Chalmers' algorithm also varies for different size of subset of the same Poker Hands data set. Future researches could be conducted to find the relation between these parameters to other information about the data set.
\item \textbf{GPU rendering:} The use of GPU for general-purpose computing (GPGPU) is gaining popularity because GPU can perform simple calculations in parallel much faster than CPU. In 2017, Khronos group have introduced WebCL\cite{WebCL}, OpenCL for web browsers. However, it have never gained any popularity and was not adopted by any browser.
Other efforts such as gpu.js\cite{gpujs} turns to using OpenGL Shading Language (GLSL) on WebGL instead. While the latest WebGL 2.0 does not support Compute Shader due to the limiting feature set of OpenGL ES 3.0\cite{WebGL2}, all of the mathematical operations used in those algorithms are supported. Following the approach, Chalmers' and the interpolation algorithms could be ported to GLSL in the future.
\item \textbf{asm.js and WebAssembly} Most implementation of JavaScript is relatively slow. asm.js gain extra performance by using only a restricted subset of JavaScript and is intended to be a compilation target from other languages such as C/C++ rather than a language to code in\cite{asmjs}. Existing JavaScript engines can gain performance from asm.js' restrictions such as preallocated heap, reducing the load on the Garbage Collector while those that recognizes asm.js can also compile it to assembly ahead-of-time (AOT), eliminating the need to run code through interpreter entirely. It is now supported by most modern browsers and have been proven to provide speed increase\cite{asmjsSpeed}. At the moment, D3-force library is still using standard JavaScript so a significant chunk of the library have to be ported in order to be able to compare different algorithms fairly.
WebAssembly (wasm), on the other hand, is a binary format designed to run with JavaScript in the same sandbox and is even faster than JavaScript\cite{WebAssembly}. Only recently released in March 2017, the support was not widespread and learning resources was hard to find. As a result, WebAssembly was not considered at the start of this project. However, as the project comes to the end, WebAssembly have gained popularity overtime and is now supported on many major web browsers such as Firefox, Chromium, Safari, and Edge.
\item \textbf{More-efficient hashing algorithms for parent finding} Over the decade, the field of machine learning and data mining have gained a lot of interest. Many improvements were made to solving related problems, including high-dimensional near neighbour searching. Newer algorithms such as data-dependent Locality-Sensitive Hashing\cite{LSH}, could provide a better execution time or more accurate result. Future researches can be carried out to incorporate these newer algorithm into the interpolation process of the Hybrid layout and evaluate any difference they make.
\item \textbf{Multi-threading with HTML5 Web Workers} By nature, JavaScript is designed to be single-threaded. HTML5 allow new processes to be created and ran concurrently. These workers have isolated memory space and are not attached to the HTML document. The only way to communicate between each other is message passing. JSON objects passed are serialized by the sender and de-serialized on the other end, creating even more overhead. Due to the size of the object the program have to work with, it is estimated that the overhead will out weight the benefit and support was not implemented.
\item \textbf{asm.js and WebAssembly:} As discussed in section \ref{ssec:des_js}, coding in lower-level languages such as C and C++ and compiling them to asm.js or WebAssembly could speed up the execution time. During the period of the project, support for WebAssembly has been growing with more learning resources available online. It is now supported on many major web browsers such as Firefox, Chrome, Safari, and Edge. The project could be ported to these languages to potentially reduce the execution time even further.
\item \textbf{More-efficient hashing algorithms for parent finding:} Over the decade, the field of machine learning and data mining have gained a lot of interest. Many improvements were made to solving related problems, including high-dimensional near neighbour searching. Newer algorithms such as data-dependent Locality-Sensitive Hashing\cite{LSH}, could provide a better execution time or more accurate result. Future researches can be carried out to incorporate these newer algorithm into the interpolation process of the Hybrid layout and evaluate any difference they make.
\item \textbf{Multi-threading with HTML5 Web Workers:} By nature, JavaScript is designed to be single-threaded. HTML5 allow new processes to be created and ran concurrently. These workers have isolated memory space and are not attached to the HTML document. The only way to communicate between each other is message passing. JSON objects passed are serialized by the sender and de-serialized on the other end, creating even more overhead. Due to the size of the object the program have to work with, it is estimated that the overhead will out weight the benefit and support was not implemented.
\end{itemize}
\section{Acknowledgements}
I would like to thank Matthew Chalmers for his guidance and feedback throughout the entire development process.
%%%%%%%%%%%%%%%%
% %
% APPENDICES %
@@ -830,6 +938,9 @@ The data sets used can also be found at
\end{verbatim}
Please note that a modern browser is required to run the application. Firefox 57 and Chrome 61 were tested, but some older versions might also works.
In order to change the value...
\chapter{Setting up development environment}
The API references and instruction for building the plug-in is available in README.md file. Please note that the build scripts are written for Linux development environment and may have to be adapted for other operating system. A built JavaScript file for the plug-in is already included with the submission, hence re-building is unnecessary.
\end{appendices}
@@ -838,7 +949,7 @@ The API references and instruction for building the plug-in is available in READ
% BIBLIOGRAPHY %
%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{plainnat}
\bibliographystyle{plainurl}
\bibliography{l4proj}
\end{document}