\documentclass{l4proj} \usepackage{url} \usepackage{fancyvrb} \usepackage[final]{pdfpages} \usepackage{algpseudocode} \usepackage{wrapfig} \begin{document} \title{Faster force-directed layout algorithms for the D3 visualisation toolkit} \author{Pitchaya Boonsarngsuk} \date{March 21, 2018} \maketitle \begin{abstract} % TODOO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO We show how to produce a level 4 project report using latex and pdflatex using the style file l4proj.cls \end{abstract} %\educationalconsent % %NOTE: if you include the educationalconsent (above) and your project is graded an A then % it may be entered in the CS Hall of Fame % \tableofcontents %============================================================================== %%%%%%%%%%%%%%%% % % % Introduction % % % %%%%%%%%%%%%%%%% \chapter{Introduction} \label{ch:intro} \pagenumbering{arabic} % ONLY DO THIS AT THE FIRST CHAPTER \section{Motivation} Talk about how data set are bigger now, before 1M data, not much. Now, just a laptop is ughhhh. Many approach, some map many features to 2 D, this is based on distance. the coor isnt important \section{Project Description} %============================================================================== %%%%%%%%%%%%%%%% % % % Background % % % %%%%%%%%%%%%%%%% \chapter{Background} \label{ch:bg} History of data vis, MDS, spring model, some other methods including parameters mapping like radar chart \section{Link force} \label{sec:linkbg} D3 library, which will be described in section \ref{ssec:d3design}, have a several different force models implemented in its Force module for creating a force directed graph. One of them is link force. In this model, a force is applied between the two nodes at the end of each link. The force pushes the nodes together or apart with varying strength, proportional to the error between the desired and current distance on the graph. Essentially, it is the same basis as the spring model. An example of a graph produced by the D3 link force is shown in figure \ref{fig:bg_linkForce}. \begin{figure}[h] \centering \includegraphics[height=9.2cm]{images/d3-force.png} \caption{An example of a graph produced by D3 link force.} \label{fig:bg_linkForce} \end{figure} The link force algorithm is inefficient. In each time step (iteration), a calculation have to be done for each pair of nodes connected with a link. This means that for our use case with a fully-connected graph (where every node is connected to every other node) of $N$ nodes, the algorithm will have to perform $N(N-1)$ force calculations per iteration, essentially $O(N^2)$. It is also believed that the number of iterations required to create a good layout is proportional to the size of the data set, hence the total time complexity of $O(N^3)$. The model also cache the desired distance of each link in memory to improve the speed across many iterations. While this greatly reduces the number of calls to the distance-calculating function. the memory complexity also increases to $O(N^2)$. Because the JavaScript memory heap is limited, it runs out of memory when trying to process a fully-connected graph of more than a ten thousands data points. \section{Chalmers' 1996 algorithm} In 1996, Matthew Chalmers proposed a technique to reduce the time complexity down to $O(N^2)$, which is a massive improvement over link force's $O(N^3)$ as described in section \ref{sec:linkbg}, at the cost of accuracy. This is done by reducing the number of spring force calculations per iterations using random samples\cite{Algo1996}. To begin, each object $i$, is assigned two distinct sets. $Neighbors$ set stores a sorted list of other objects that are closest to $i$, i.e. have low high-dimensional distance. These objects are expected to be put nearby in 2D space. At the start, this set is empty. $Neighbors$ set is referred to as $V$ in the original paper. The second set is $Samples$ (referred to as $S$). This set contains a number of other random objects that not member of the $Neighbors$ set, and is regenerated at the start of every iteration. In each iteration, each object $i$ only perform spring force calculations against every other objects in its $Neighbors$ and $Samples$ sets. Afterward, each random object is then compared against other objects in the $Neighbors$ set. If a random object is closer to $i$, then it is swapped into the $Neighbors$ set. As a result, the $Neighbors$ set becomes a better representative of the most similar objects to $i$. The total number of spring calculations per iteration reduces from $N(N-1)$ to $N(Neighbours_{size} + Samples_{size})$ where $Neighbours_{size}$ and $Samples_{size}$ denotes the maximum number of objects each $Neighbours$ and $Samples$ set, respectively. Because these two numbers are pre-set constants, the time complexity is $O(N)$. Previous evaluations indicated that the quality of the produced layout improves as $Neighbours_{size}$ and $Samples_{size}$ grows larger. For larger datasets, setting the too small values could cause the algorithm to miss some details. However, favorable results can be obtained from numbers as low as 5 and 10 for $Neighbours_{size}$ and $Samples_{size}$\cite{Algo2002}. \section{Hybrid Layout for Multidimensional Scaling} In 2002, Alistair Morrison, Greg Ross, and Matthew Chalmers introduced a multi-phase, based on Chalmers' 1996 algorithm, to reduce the run time down to $O(N\sqrt{N})$. This is achieved by calculating the spring forces over a subset of data, and interpolating the rest onto the 2D space\cite{Algo2002}. %TODO Maybe history of hybrid layout, 3rd section on original paper In this hybrid layout method, the $\sqrt{N}$ sample objects ($S$) are first placed on the 2D space, using the 1996 algorithm. The complexity of this step is $O(\sqrt{N}\sqrt{N})$ or $O(N)$. After that, each of the other objects $i$ are then interpolated as described below. \begin{enumerate} \item \label{step:hybridFindPar} Find the 'parent' object $x\in{S}$ with the minimal high-dimensional distance to $i$. This is essentially a nearest neighbor searching problem. \item Define a circle around $x$ with radius $r$, proportional to the high-dimensional distance between $x$ and $i$. \item Find the quadrant of the circle which is the most satisfactory to place $i$. \item Perform a binary search on the quadrant to determine the best angle of $i$ and place it there. \item Select random samples $s$ from $S$. $s\subset{S}$. \item \label{step:hybridFindVec} Calculate the sum of force vector between $i$ and each member of $s$. \item \label{step:hybridApplyVec} Add the vector to $i$'s current position. \item Repeat step \ref{step:hybridFindVec} and \ref{step:hybridApplyVec} for a constant number of times to refine the placement. \end{enumerate} In this process, step \ref{step:hybridFindPar} has the highest time complexity of $O(S_{size})$ i.e. $O(\sqrt{N})$. Because there are $N-\sqrt{N}$ objects to interpolate, the overall complexity of this step is $O(N\sqrt{N})$. Finally, the Chalmers' spring model is applied to the full data set for a constant number of iterations. This operation have the time complexity of $O(N)$. Previous evaluations show that this method is faster that the 1996 algorithm alone, and can create a layout with lower stress, thanks to the more accurate positioning in the interpolation process. \section{Hybrid MDS with Pivot-Based Searching algorithm} \begin{wrapfigure}{rh}{0.3\textwidth} \centering \includegraphics[width=0.3\textwidth]{images/pivotBucketsIllust.png} \caption{Diagram of a pivot (dark shaded point) with six buckets, illustrated as discs between dotted circle. Each of the other points are classified into buckets by the distances to the pivot.} \label{fig:bg_pivotBuckets} \end{wrapfigure} The bottleneck of the hybrid model is the nearest-neighbor searching process during the interpolation. The previous brute-force method results in time complexity of $O(N\sqrt{N})$. This improvement introduces pivot-based searching to approximate a near-neighbor instead. This reduces the time complexity to $O(N^\frac{5}{4})$\cite{Algo2003}. The main improvements is gained by preprocessing the set $S$ ($\sqrt{N}$ samples) so that each of the $N-\sqrt{N}$ other points can find the parent is faster. To begin, $k$ points were selected from $S$ as 'pivots'. Each pivot $p\in{k}$ have a number of buckets. Every other points in $S$ assigned a bucket number, based on the distance from to $p$ as illustrated in figure \ref{fig:bg_pivotBuckets}. To find a parent of an object, a distance calculation is first performed against each pivot to determine which bucket of each pivot is the object in. From this, the content of each bucket is searched for the nearest neighbor. \begin{algorithmic} \item Preprocessing: \ForAll{$\sqrt{n}$ samples in $S$} \ForAll{pivot in $k$} \State Perform distance calculation \EndFor \EndFor \end{algorithmic} \begin{algorithmic} \item Find parent for object $i$: \ForAll{pivot $p$ in $k$} \State Perform distance calculation. \State Determine the bucket for $i$ in $p$. \ForAll{point in the bucket} \State Perform distance calculation \EndFor \EndFor \end{algorithmic} The complexity of the preprocessing stage is $O(\sqrt{N}k)$. For query, the average number of points in each bucket is $\frac{S_{size}}{number of buckets} = \frac{\sqrt{n}}{n^{\frac{1}{4}}}$. Since a query will be performed for each of the $N-\sqrt{N}$ points not in $S$, overall complexity is $O(\sqrt{N}k) + (N-\sqrt{N})N^{\frac{1}{4}} = O(N^{\frac{5}{4}})$. With this method, the parent found is not guaranteed to be the closest point. Prior evaluations have concluded that the accuracy is high enough to produce good result. \section{Performance Metrics} \label{sec:bg_metrics} To compare different algorithms they have to be tested against the same set of performance metric. During the development, a number of metrics were used to objectively judge the resulting graph and computation requirement. The evaluation process in section \ref{ch:eval} will focuses on the following metrics. \begin{itemize} \item \textbf{Execution time} is a broadly used metric for any algorithm requiring any significant computational power. Some applications aim to be interactive and the algorithm have to finish the calculations within the time constraints for the program to stay responsive. This project, however, focuses on large data sets with minimal user interaction. Hence, the execution time in this project is a measures of the time an algorithm takes to produce its "final" result. The criteria for this will be discussed in details in section \ref{ch:eval}. \item \textbf{Stress} is one of the most popular metric for spring-based layout algorithm, modeled from the mechanical stress of a spring system. It is based on sum-of-squared errors of inter-object distance\cite{Algo1996}. The function is defined as follow. $$Stress = \frac{\sum_{i java MaxClique BBMC1 brock200_1.clq 14400 \end{verbatim} This will apply $BBMC$ with $style = 1$ to the first brock200 DIMACS instance allowing 14400 seconds of cpu time. \chapter{Generating Random Graphs} \label{sec:randomGraph} We generate Erd\'{o}s-R\"{e}nyi random graphs $G(n,p)$ where $n$ is the number of vertices and each edge is included in the graph with probability $p$ independent from every other edge. It produces a random graph in DIMACS format with vertices numbered 1 to $n$ inclusive. It can be run from the command line as follows to produce a clq file \begin{verbatim} > java RandomGraph 100 0.9 > 100-90-00.clq \end{verbatim} \end{appendices} %%%%%%%%%%%%%%%%%%%% % BIBLIOGRAPHY % %%%%%%%%%%%%%%%%%%%% \bibliographystyle{plain} \bibliography{l4proj} \end{document} #if 0 \chapter{Introduction} \label{intro} then \ref{intro} \section{First Section in Chapter} \subsection{A subsection} \vspace{-7mm} \begin{figure} \centering \includegraphics[height=9.2cm,width=13.2cm]{uroboros.pdf} \vspace{-30mm} \caption{An alternative hierarchy of the algorithms.} \label{uroborus} \end{figure} \begin{verbatim} > pdflatex example0 > bibtex example0 > pdflatex example0 > pdflatex example0 \end{verbatim} #endif