Files
L4Proj_Dissertation/l4proj.tex

389 lines
28 KiB
TeX

\documentclass{l4proj}
\usepackage{url}
\usepackage{fancyvrb}
\usepackage[final]{pdfpages}
\usepackage{algpseudocode}
\usepackage{wrapfig}
\begin{document}
\title{Faster force-directed layout algorithms for the D3 visualisation toolkit}
\author{Pitchaya Boonsarngsuk}
\date{March 21, 2018}
\maketitle
\begin{abstract}
% TODOO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO
We show how to produce a level 4 project report using latex and pdflatex using the
style file l4proj.cls
\end{abstract}
%\educationalconsent
%
%NOTE: if you include the educationalconsent (above) and your project is graded an A then
% it may be entered in the CS Hall of Fame
%
\tableofcontents
%==============================================================================
%%%%%%%%%%%%%%%%
% %
% Introduction %
% %
%%%%%%%%%%%%%%%%
\chapter{Introduction}
\label{ch:intro}
\pagenumbering{arabic} % ONLY DO THIS AT THE FIRST CHAPTER
\section{Motivation}
Talk about how data set are bigger now, before 1M data, not much. Now, just a laptop is ughhhh.
Many approach, some map many features to 2 D, this is based on distance. the coor isnt important
\section{Project Description}
%==============================================================================
%%%%%%%%%%%%%%%%
% %
% Background %
% %
%%%%%%%%%%%%%%%%
\chapter{Background}
\label{ch:bg}
History of data vis, MDS, spring model, some other methods including parameters mapping like radar chart
\section{Link force}
\label{sec:linkbg}
D3 library, which will be described in section \ref{ssec:d3design}, have a several different force models implemented in its Force module for creating a force directed graph. One of them is link force. In this model, a force is applied between the two nodes at the end of each link. The force pushes the nodes together or apart with varying strength, proportional to the error between the desired and current distance on the graph. Essentially, it is the same basis as the spring model. An example of a graph produced by the D3 link force is shown in figure \ref{fig:bg_linkForce}.
\begin{figure}[h]
\centering
\includegraphics[height=9.2cm]{images/d3-force.png}
\caption{An example of a graph produced by D3 link force.}
\label{fig:bg_linkForce}
\end{figure}
The link force algorithm is inefficient. In each time step (iteration), a calculation have to be done for each pair of nodes connected with a link. This means that for our use case with a fully-connected graph (where every node is connected to every other node) of $N$ nodes, the algorithm will have to perform $N(N-1)$ force calculations per iteration, essentially $O(N^2)$. It is also believed that the number of iterations required to create a good layout is proportional to the size of the data set, hence the total time complexity of $O(N^3)$.
The model also cache the desired distance of each link in memory to improve the speed across many iterations. While this greatly reduces the number of calls to the distance-calculating function. the memory complexity also increases to $O(N^2)$. Because the JavaScript memory heap is limited, it runs out of memory when trying to process a fully-connected graph of a few thousand data points.
\section{Chalmers' 1996 algorithm}
In 1996, Matthew Chalmers proposed a technique to reduce the time complexity down to $O(N^2)$, which is a massive improvement over link force's $O(N^3)$ as described in section \ref{sec:linkbg}, at the cost of accuracy. This is done by reducing the number of spring force calculations per iterations using random samples\cite{Algo1996}.
To begin, each object $i$, is assigned two distinct sets. $Neighbors$ set stores a sorted list of other objects that are closest to $i$, i.e. have low high-dimensional distance. These objects are expected to be put nearby in 2D space. At the start, this set is empty. $Neighbors$ set is referred to as $V$ in the original paper. The second set is $Samples$ (referred to as $S$). This set contains a number of other random objects that not member of the $Neighbors$ set, and is regenerated at the start of every iteration.
In each iteration, each object $i$ only perform spring force calculations against every other objects in its $Neighbors$ and $Samples$ sets. Afterward, each random object is then compared against other objects in the $Neighbors$ set. If a random object is closer to $i$, then it is swapped into the $Neighbors$ set. As a result, the $Neighbors$ set becomes a better representative of the most similar objects to $i$.
The total number of spring calculations per iteration reduces from $N(N-1)$ to $N(Neighbours_{size} + Samples_{size})$ where $Neighbours_{size}$ and $Samples_{size}$ denotes the maximum number of objects each $Neighbours$ and $Samples$ set, respectively. Because these two numbers are pre-set constants, the time complexity is $O(N)$.
Previous evaluations indicated that the quality of the produced layout improves as $Neighbours_{size}$ and $Samples_{size}$ grows larger. For larger datasets, setting the too small values could cause the algorithm to miss some details. However, favorable results can be obtained from numbers as low as 5 and 10 for $Neighbours_{size}$ and $Samples_{size}$\cite{Algo2002}.
\section{Hybrid Layout for Multidimensional Scaling}
In 2002, Alistair Morrison, Greg Ross, and Matthew Chalmers introduced a multi-phase, based on Chalmers' 1996 algorithm, to reduce the run time down to $O(N\sqrt{N})$. This is achieved by calculating the spring forces over a subset of data, and interpolating the rest onto the 2D space\cite{Algo2002}.
%TODO Maybe history of hybrid layout, 3rd section on original paper
In this hybrid layout method, the $\sqrt{N}$ sample objects ($S$) are first placed on the 2D space, using the 1996 algorithm. The complexity of this step is $O(\sqrt{N}\sqrt{N})$ or $O(N)$. After that, each of the other objects $i$ are then interpolated as described below.
\begin{enumerate}
\item \label{step:hybridFindPar} Find the 'parent' object $x\in{S}$ with the minimal high-dimensional distance to $i$. This is essentially a nearest neighbor searching problem.
\item Define a circle around $x$ with radius $r$, proportional to the high-dimensional distance between $x$ and $i$.
\item Find the quadrant of the circle which is the most satisfactory to place $i$.
\item Perform a binary search on the quadrant to determine the best angle of $i$ and place it there.
\item Select random samples $s$ from $S$. $s\subset{S}$.
\item \label{step:hybridFindVec} Calculate the sum of force vector between $i$ and each member of $s$.
\item \label{step:hybridApplyVec} Add the vector to $i$'s current position.
\item Repeat step \ref{step:hybridFindVec} and \ref{step:hybridApplyVec} for a constant number of times to refine the placement.
\end{enumerate}
In this process, step \ref{step:hybridFindPar} has the highest time complexity of $O(S_{size})$ i.e. $O(\sqrt{N})$. Because there are $N-\sqrt{N}$ objects to interpolate, the overall complexity of this step is $O(N\sqrt{N})$.
Finally, the Chalmers' spring model is applied to the full data set for a constant number of iterations. This operation have the time complexity of $O(N)$.
Previous evaluations show that this method is faster that the 1996 algorithm alone, and can create a layout with lower stress, thanks to the more accurate positioning in the interpolation process.
\section{Hybrid MDS with Pivot-Based Searching algorithm}
\begin{wrapfigure}{rh}{0.3\textwidth}
\centering
\includegraphics[width=0.3\textwidth]{images/pivotBucketsIllust.png}
\caption{Diagram of a pivot (dark shaded point) with six buckets, illustrated as discs between dotted circle. Each of the other points are classified into buckets by the distances to the pivot.}
\label{fig:bg_pivotBuckets}
\end{wrapfigure}
The bottleneck of the hybrid model is the nearest-neighbor searching process during the interpolation. The previous brute-force method results in time complexity of $O(N\sqrt{N})$. This improvement introduces pivot-based searching to approximate a near-neighbor instead. This reduces the time complexity to $O(N^\frac{5}{4})$\cite{Algo2003}.
The main improvements is gained by preprocessing the set $S$ ($\sqrt{N}$ samples) so that each of the $N-\sqrt{N}$ other points can find the parent is faster. To begin, $k$ points were selected from $S$ as 'pivots'. Each pivot $p\in{k}$ have a number of buckets. Every other points in $S$ assigned a bucket number, based on the distance from to $p$ as illustrated in figure \ref{fig:bg_pivotBuckets}.
To find a parent of an object, a distance calculation is first performed against each pivot to determine which bucket of each pivot is the object in. From this, the content of each bucket is searched for the nearest neighbor.
\begin{algorithmic}
\item Preprocessing:
\ForAll{$\sqrt{n}$ samples in $S$}
\ForAll{pivot in $k$}
\State Perform distance calculation
\EndFor
\EndFor
\end{algorithmic}
\begin{algorithmic}
\item Find parent for object $i$:
\ForAll{pivot $p$ in $k$}
\State Perform distance calculation.
\State Determine the bucket for $i$ in $p$.
\ForAll{point in the bucket}
\State Perform distance calculation
\EndFor
\EndFor
\end{algorithmic}
The complexity of the preprocessing stage is $O(\sqrt{N}k)$. For query, the average number of points in each bucket is $\frac{S_{size}}{number of buckets} = \frac{\sqrt{n}}{n^{\frac{1}{4}}}$. Since a query will be performed for each of the $N-\sqrt{N}$ points not in $S$, overall complexity is $O(\sqrt{N}k) + (N-\sqrt{N})N^{\frac{1}{4}} = O(N^{\frac{5}{4}})$.
With this method, the parent found is not guaranteed to be the closest point. Prior evaluations have concluded that the accuracy is high enough to produce good result.
\section{Performance Metrics}
To compare different algorithms they have to be tested against the same set of performance metric. During the development, a number of metrics were used to objectively judge the resulting graph and computation requirement. The evaluation process in section \ref{ch:eval} will focuses on the following metrics.
\begin{itemize}
\item \textbf{Execution time} is a broadly used metric for any algorithm requiring any significant computational power. Some applications aim to be interactive and the algorithm have to finish the calculations within the time constraints for the program to stay responsive. This project, however, focuses on large data sets with minimal user interaction. Hence, the execution time in this project is a measures of the time an algorithm takes to produce its "final" result. The criteria for this will be discussed in details in section \ref{ch:eval}.
\item \textbf{Stress} is one of the most popular metric for spring-based layout algorithm, modeled from the mechanical stress of a spring system. It is based on sum-of-squared errors of inter-object distance\cite{Algo1996}. The function is defined as follow. $$Stress = \frac{\sum_{i<j} (d_{ij}-g_{ij})^2}{\sum_{i<j} g^2_{ij}}$$ $d_{ij}$ denotes the desired high-dimensional distance between object $i$ and $j$ while $g_{ij}$denotes the low-dimensional distance.
While Stress is a good metric to evaluate a layout, its calculation is an expensive operation ($O(n^2)$) and is not part of the operation of any algorithm. As a result, we can not measure the execution time of an algorithm if we calculate the stress between each iteration.
\item \textbf{Memory usage} With more interests in machine learning, the number of data points in a data set is getting bigger. It is common to encounter data sets with tens or hundreds of thousands instances, each with possibly hundreds of attributes. Therefore, memory usage shows how an algorithm scales to larger data sets and how many data points can a computer system handle.
\end{itemize}
%==============================================================================
%%%%%%%%%%%%%%%%
% %
% Design %
% %
%%%%%%%%%%%%%%%%
\chapter{Design}
\label{ch:design}
something
\section{Technologies}
\subsection{HTML, CSS, and SVG}
%============================
\subsection{Javascript}
%============================
\subsection{Data Driven Document}
\label{ssec:d3design}
%============================
\subsection{Bartasius' D3 Neighbour Sampling plug-in}
%============================
\section{Input Data and Parameters}
\section{Graphical User Interface for Evaluation}
%==============================================================================
%%%%%%%%%%%%%%%%
% %
% Implement %
% %
%%%%%%%%%%%%%%%%
\chapter{Implementation}
\label{ch:imp}
\section{Outline}
\section{Algorithms}
\subsection{Link force}
%============================
\subsection{Chalmers' 1996}
%============================
\subsection{Hybrid Layout}
%============================
\subsection{Hybrid Layout with Pivot}
%============================
\section{Integration with D3}
\section{Performance-improving Decisions}
\subsection{Different types of loops}
%============================
\subsection{Caching distances for Chalmers' 1996 algorithm}
%============================
%==============================================================================
%%%%%%%%%%%%%%%%
% %
% EVAL %
% %
%%%%%%%%%%%%%%%%
\chapter{Evaluation}
\label{ch:eval}
%TODO SOMETHING HERE
% Link is golden standard, the rest try to get to that but cut corners
\section{Data Sets}
\label{sec:EvalDataSet}
The data sets utilized during the developments are the Iris, Poker Hands\cite{UCL_Data}, and Antarctic data set\cite{Antartica_Data}.
The Iris is one of the most popular data set to get started in Machine Learning. It contain 150 measurements from flowers of Iris Setosa, Iris Versicolour and Iris Virginica species, each with four parameters: petal and sepal width and height in centimeter. It is chosen as a starting point for development because it is a classification data set where the parameters can be used by the distance function and the label is only used to color each instance. Each species is also clustered quite clearly, making it easier to see if the algorithm is working as intended.
The Poker Hands is another classification dataset containing possible hands of 5 playing cards, each is described in rank (Ace, 2, 3,...) and suit (Hearts, Spades, etc). Each hand is labeled with the poker hand (unrecognized, Flush, Full house, etc). This data set is selected for the experiment because it contains over a million records. In each test, only subsets of the data is used due to size limitation.
\begin{figure}
\centering
\includegraphics[height=6cm]{layout/Link10000Stable_crop.png}
\caption{A subset of 10,000 data points of the Poker Hands data set, illustrated by D3 Link force which should produces the most accurate layout.}
\label{fig:eval_idealSample}
\end{figure}
The Antarctic data set contain 2,202 measurements by remote sensing probes over 2 weeks at a frozen lake in the Antarctic. Features includes water temperature, UV radiation levels, ice thickness, etc. The data is formatted into CSV by Greg Ross and is used to represent a data set with complex structure. Due to the relatively small size of this data set, it is only used to compare the ability to show fine details.
\section{Experimental Setup}
% TODOO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO
Hardware and web browser can greatly impact the JavaScript performance. In addition to from the code and dataset, these variables have to be controlled as well.
The computers used are all the same model of a Dell All-in-One desktop computer with Intel\textregistered{} Core\texttrademark{} i5-3470S and 8GB of DDR3 memory, running CentOS 7 with Linux 3.10-x86-64.
As for web browser, the official 64-bit build of Google Chrome 61.0.3163.79 is used to both run and analyse CPU and memory usage with its performance profiling tool.
Other unrelated parameters have to also be controlled as much as possible. The starting position of all nodes are locked at $(0,0)$ and the simulation's velocity decay is set at default of $0.4$, mimicking air friction. Alpha, a decaying value used for artificially slowing down or freezing the system over time, is also kept at 1 to keep the springs' forces in full effect. The web page is also refreshed after every run to make sure that everything, including uncontrollable aspects such as JavaScript heap and the behavior of the browser's garbage collector, have been properly reset.
\subsection{Termination criteria}
Both D3 Link force and the 1996 algorithm create a layout that stabilises over time. In D3, calculations are performed for a predefined number of iterations. This have a drawback of having to select an appropriate number. Choosing the number too high means that execution time is wasted calculating minute details with no visible change to the layout while the opposite can results in a bad layout.
Determining the constant number can be problematic, considering that each algorithm may stabilise after different number of iterations, especially when the interpolation result can vary greatly from run-to-run.
An alternative method is to stop when a condition is met. One of such condition purposed is the difference in velocity of the system between iterations\cite{Algo2002}. In other word, once the amount of force applied in that iteration is lower than a scalar threshold, the calculation may stop. Taking note of stress and average force applied over multiple iterations as illustrated in figure \ref{fig:eval_stressVeloOverTime}, it is clear that D3 Link force converges zero while 1996 algorithms reaches and fluctuate around a constant. Because the $Samples$ set keeps changing, the system will not reach a state where every spring forces cancel each other out nearly completely. This is also reflected in the animation where every nodes keep wiggling but the overall layout remains constant. It can also be seen that stress of each layout converges a minimal value as the average force converges a constant, indicating that the best layout from each algorithm can be obtained once the system stabilizes.
Since stress is takes too long to calculate iteration, the termination criteria is steeled with average force applied. This criteria is used for all 3 algorithms. The cut-off constant is then manually selected for each algorithm for every subset of Poker Hands. D3 Link force's threshold is a value that is very low that stress is stabilized and does not make any visible changes while 1996's is the lowest possible value that is reached most of the time.
By selecting this termination condition, the goal of the last phase of the Hybrid Layout algorithm is changed. Rather than performing 1996 algorithm over the whole dataset to correct interpolation error, the interpolation phase's role is to help the final phase reaches stability quicker. Thus, parameters of the interpolation phase can not be evaluated on their own. Taking more time to produce a better interpolation result may or may not effect the number of iterations in the final phase, creating the need to balance between time spent and saved by interpolation.
\begin{figure}
\centering
\includegraphics[height=5cm]{graphs/stressVeloOverTime.png}
\caption{A log-scaled graph showing decreasing stress and forces applied per iteration over time covering a constant number.} %10,000 data points
\label{fig:eval_stressVeloOverTime}
\end{figure}
%============================
\subsection{Selecting Parameters}
Some of the algorithms have variables that are predefined constant numbers. Choosing the wrong values could lead the algorithm to produce bad results or takes unnecessarily long computation time and memory. To compare each algorithm fairly, an optimal set of parameters have to be chosen for each.
D3 Link force have no adjustable parameter for the use case.
The 1996 algorithm have two parameters: $Neighbours_{size}$, $Samples_{size}$.
According to previous evaluations\cite{LastYear}\cite{Algo2002}, favorable layout could be achieved with values as low as $10$ for both variables. Preliminary testings seems to suggest so and the value is selected for the experiments.
\begin{figure}
\centering
\includegraphics[height=10cm]{layout/interpVar.png}
\caption{Difference in interpolation results of a subset with 1,000 data points. Left images shows only data points in set $S$ and the right shows the result immediately after interpolation. Set $S$ of the images below only contains samples from class "Unrecognized" and "One pair" (colored blue and orange respectively), resulting in low accuracy in interpolating points of other classes.}
\label{fig:eval_interpVariations}
\end{figure}
Hybrid layout have multiple parameters during the interpolation phase. For the parent-finding stage, there is a choice of weather to use brute-force or pivot-based searching method. In case of pivot-based, the number of pivots ($k$) have to also be chosen. Experiments have been run to find the accuracy of pivot-based searching, starting from $1$ pivot to determine reasonable numbers to use in subsequence experiments. However, as shown in figure \ref{fig:eval_interpVariations}, the randomly selected $S$ set (the $\sqrt{N}$ samples used in the first stage with the 1996 algorithm) can greatly affect the interpolation result, especially with smaller data set with many small clusters. Therefore, each tests have to be run multiple times to generalise the result. As illustrated in figure \ref{fig:eval_pivotHits}, the more pivots, the higher accuracy and consistency. The diminishing returns can already be observed at around 6 to 10 pivots, depending on number of data point. Hence, higher number of pivots are no longer considered. The number of pivots to use in subsequence experiments will only be 1, 3, 6, and 10.
\begin{figure}
\centering
\includegraphics[height=5cm]{graphs/hitrate_graph1.png}
\includegraphics[height=5cm]{graphs/hitrate_graph2.png}
\caption{Graphs showing accuracy of pivot-based searching between $k = $ 1, 3, 6, and 10. On the left is the percentage in which pivot-based searching returns the same result as brute-force searching from 5 different runs (higher is better). On the right is the high-dimensional distance ratio between the candidate parent chosen by pivot-based searching and the best parent as found by the brute-force method (closer to 1 is better). For example, if the best parent is 1 unit away from the querying node, a ratio of 1.3 means that the candidate parent is 1.3 unit away. The subset used has 100,000 data points.}
\label{fig:eval_pivotHits}
\end{figure}
Finally, the last step of interpolation is to refine placement for a constant number of times. Preliminary testings shows that while this step can clean up interpolation artifacts as shown in figure \ref{fig:eval_refineCompare}, desirable layout can not be obtained now matter how many steps the refinement takes, hence the 1996 algorithm has to be run over the entire data set after the interpolation phase. For the rest of the experiment, only two values, 0 and 20 are used, representing with and without interpolation artifacts cleaning.
\begin{figure}
\centering
\includegraphics[height=5cm]{layout/refineCompare.png}
\caption{A comparison between interpolation without no (left) and 20 (right) refinement steps. The left image shows more interpolation artifacts, especially in the bottom-right corner where the parent nodes of multiple points can be inferred from the way multiple points line up.}
\label{fig:eval_refineCompare}
\end{figure}
%============================
\subsection{Performance metrics}
% RAM, Time, Stress, Layout
%============================
\section{Results}
%==============================================================================
%%%%%%%%%%%%%%%%
% %
% Conclusion %
% %
%%%%%%%%%%%%%%%%
\chapter{Conclusion}
\label{ch:conc}
\section{Summary}
\section{Learning Experience}
\section{Future Work}
\begin{itemize}
\item \textbf{Incorporating Chalmers' 1996 algorithm into D3 framework}
\item \textbf{Data Exploration Test} The project focuses on overall layouts produced by each algorithm and a single Stress metric. One of the goal of MDS is to explore data, which is not been assessed. A good tool and layout should help users identify patterns and meanings behind small clusters with less effort. The project could be extended to include data investigation tools.
\item \textbf{Data Sets} The evaluation focuses on only 1 data set. It is possible that the algorithms could behave differently on different dataset with different dimensionality, data types and distance functions. Hence, findings in chapter \ref{ch:eval} may not apply to all.
\item \textbf{Optimal parameters generalisation}
\item \textbf{GPU rendering}
\item \textbf{asm.js and wasm} Most implementation of JavaScript is relatively slow. asm.js gain extra performance by using only a restricted subset of JavaScript and is intended to be a compilation target from other languages such as C/C++ rather than a language to code in. Existing JavaScript engines can run asm.js while those that recognizes asm.js can also compile it to assembly ahead-of-time (AOT), eliminating the need to run code through interpreter. At the moment, D3-force library is still using standard JavaScript so a significant chunk of the library have to be ported in order to be able to compare different algorithms fairly.
WebAssembly (wasm), on the other hand, is a binary format designed to run with JavaScript in the same sandbox and is even faster than JavaScript. Many major web browsers such as Firefox, Chromium, Safari, and Edge supports WebAssembly. Only recently released in March 2017, the support was not widespread and learning resources was hard to find. As a result, WebAssembly was not considered at the start of this project. %REF ME DADDY
\item \textbf{Locality-Sensitive Hashing}
\item \textbf{Multi-threading with HTML5 Web Workers} By nature, JavaScript is designed to be single-threaded. HTML5 allow new processes to be created and ran concurrently. These workers have isolated memory space and are not attached to the HTML document. The only way to communicate between each other is message passing. JSON objects passed are serialized by the sender and de-serialized on the other end, creating even more overhead. Due to the size of the object the program have to work with, it is estimated that the overhead will out weight the benefit and support was not implemented. %REF ME DADDY
\end{itemize}
%%%%%%%%%%%%%%%%
% %
% APPENDICES %
% %
%%%%%%%%%%%%%%%%
\begin{appendices}
\chapter{Running the Programs}
An example of running from the command line is as follows:
\begin{verbatim}
> java MaxClique BBMC1 brock200_1.clq 14400
\end{verbatim}
This will apply $BBMC$ with $style = 1$ to the first brock200 DIMACS instance allowing 14400 seconds of cpu time.
\chapter{Generating Random Graphs}
\label{sec:randomGraph}
We generate Erd\'{o}s-R\"{e}nyi random graphs $G(n,p)$ where $n$ is the number of vertices and
each edge is included in the graph with probability $p$ independent from every other edge. It produces
a random graph in DIMACS format with vertices numbered 1 to $n$ inclusive. It can be run from the command line as follows to produce
a clq file
\begin{verbatim}
> java RandomGraph 100 0.9 > 100-90-00.clq
\end{verbatim}
\end{appendices}
%%%%%%%%%%%%%%%%%%%%
% BIBLIOGRAPHY %
%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{plain}
\bibliography{l4proj}
\end{document}
#if 0
\chapter{Introduction}
\label{intro} then \ref{intro}
\section{First Section in Chapter}
\subsection{A subsection}
\vspace{-7mm}
\begin{figure}
\centering
\includegraphics[height=9.2cm,width=13.2cm]{uroboros.pdf}
\vspace{-30mm}
\caption{An alternative hierarchy of the algorithms.}
\label{uroborus}
\end{figure}
\begin{verbatim}
> pdflatex example0
> bibtex example0
> pdflatex example0
> pdflatex example0
\end{verbatim}
#endif