Graph Spectral Clustering for User Preference Categorization in Social Networks
In this article, we are going to explore a fragment of the Yelp social network - a website where both restaurants and users can post food-related activity - with the goal of finding groups of users with similar restaurant preferences. Social networks are structured in the form of graphs, and we are often interested in extracting information offered by the connections between the different nodes. In our network of consideration, we have access to a graph which comes in the form of an adjacency matrix, offering a complete description of which nodes are connected by which edges and with which edge-weights. From this graph, we want to find groups of nodes that share similar connections, and after we have found the different clusters of nodes (i.e. users), we will evaluate how well our algorithms managed to reliably detect such clusters by analyzing an additional user-preference feature matrix.
For solving this problem, we do not have access to the feature matrix F, but rather only the adjacency matrix A, where any two nodes (users) which are connected by a weight ‘w,’ indicates that both users left positive reviews to the same w restaurants. Such information leads us to believe that groups of users belonging to the same community will share a large amount of (intra-community) edges, while also exhibiting minimal (inter-community) connections with non-community members. It is important to note that a “community” within such a graph, will be the main differentiating factor which separates different groups of users. If our nodes correspond to users all around the world, then the most likely differentiating factor will be the geographical location of such nodes, whereby the communities in such a scenario will tend to correspond to counties and localities across many different nations. Since we are analyzing a fragment of the social network restricted to a small geographic location in the United States, the differentiating factor between nodes will be different. We expect to find clusters differentiated by the next most likely factor, food preference.
Section I: Background & Theory
As stated in the introduction, we expect clusters to exhibit a large amount intra-community edges and a small amount of inter-community edges. This is a very common problem which naively started out as a combinatorial optimization problem, where the objective was to partition the vertex set into clusters, maximizing an objective subject to the constraints that the union of clusters must make up the vertex set, and that no clusters may overlap.
Such a problem is NP-Hard when searching for more than two clusters, mostly due to the discrete constraints. This motivated the introduction of the source-sink minimum-cut problem, where the graph is cut into two parts, where one side is a cluster and the other side can be one (or more) clusters. The minimum-cut problem under specification of a “source” and “sink,” which indicates a flow through the graph and restricts the number of possible cuts, can be solved in polynomial time by the algorithm of Ford and Fulkerson. However, if no source or sink is given (“global” minimum cut), then the problem still remains NP-Hard. A further issue with the global minimum cut problem is that it tends to isolate single nodes, prioritizing such singletons before clusters.
In order the rectify the singleton issue, the objective function of the optimization problem (namely, the cut value) became normalized by some function of the amount of nodes belonging to the cluster. This way, if cutting a singleton yielded minimum-cut, the denominator of the cut value would be much smaller than when cutting through a cluster (which would clearly contain more nodes and thus a higher denominator). The normalization criteria commonly used are either the amount of nodes within the cluster (ratio-cut), or the sum of degrees exhibited by the nodes within the cluster (normalized-cut).
Although now our optimization formulation now reliably yields clusters, we still have the problem that the ratio-cut and normalized-cut problems are still NP-Hard. To fix this, it was found that both problems can be equivalently formulated using the graph Laplacian, and can undergo a constraint relaxation without much effect on accuracy. The mathematical formulations of both problems can be shown in the following for a vertex set ‘V’ and search for ‘K’ clusters.
The introduced graph Laplacian can be interpreted as an analogue to the standard Laplacian, and measures to what extent a function differs at a point from its values at nearby points. This matrix, which is defined as L = D - A (where D is a diagonal matrix whose entries correspond to the node degrees), offers very useful information about the graph structure and is what allows us to reformulate the problem into this more favorable form. The most useful information which we will exploit in our approach to this problem, lies within the eigenvalues of the Laplacian. Firstly, every Laplacian has an eigenvalue of zero with a corresponding constant eigenvector of ones. In fact, the algebraic multiplicity of the zero-eigenvalue corresponds to how many disconnected components exist within the graph. Meaning, that if a Laplacian holds three value-zero eigenvalues, there exist three disconnected subgraphs within the graph. Even if the components are not totally disconnected, but rather somewhat disconnected (as one would expect between clusters) the entry values held by the eigenvectors offer insight into how connected these components are. The following example makes shows this property more clearly, where the columns of the matrix correspond to the eigenvectors of the graph Laplacian.
We can clearly see from the diagram, that the entries of the eigenvectors corresponding to the second and third eigenvalues offer information about the connectivity between and within the graph’s clusters. In fact, by plotting the rows of these eigenvectors (i.e. node embeddings) onto the xy-plane (embedding space), we can clearly see the clusters they form. It turns out that within this embedding space, the closer two points are, the greater the connectivity is between the nodes, while the further away they are, the less connected they are. This should enlighten us to the fact, that if we want to find the clusters within a given graph, we need only find the eigenvectors of the graph Laplacian, and apply some clustering algorithm within the generated embedding space. If we take another look at our relaxed optimization problem, we can see that our goal of finding clusters via minimizing the ratio (or normalized) cut corresponds exactly to finding these eigenvalues. In the next section, we will implement this theory into practice by finding the eigenvalues of the graph Laplacian created by our given social network.
Section II: Implementation
To recap, in order to find the different clusters given a graph (i.e. adjacency matrix) A, we must compute the graph Laplacian L, and then find the eigenvectors of this matrix, which corresponds to solving the relaxed optimization problem (as well as the ratio/normalized cut problem) and yields embeddings for the nodes of the graph within a space upon which we can apply a standard clustering algorithm to find the graph clusters we are looking for. To implement this, we will construct the following functions.
def construct_laplacian(A, bool_norm_lap) -> L def spectral_embedding(A, num_clusters, bool_norm_lap) -> embeddings def spectral_clustering(A, num_clusters, bool_norm_lap) -> predicted_clusters def compute_ratio_cut(A, predicted_clusters) -> ratio_cut_float def compute_normalized_cut(A, predicted_clusters) -> norm_cut_float def print_top_categories_for_each_cluster(5, predicted_clusters, F, categories) -> None
These functions will then be used in section III to find the most common food preferences of users in our snapshot of Yelp’s social network. I will briefly explain the functionalities provided by each function, but I recommend to visit the Github link to view the code in more detail and execute it yourself.
Def. 1: This function uses the given adjacency matrix to construct and return the graph Laplacian, or the normalized graph Laplacian, depending on whether the second argument is is passed as true of false. The normalized graph Laplacian is used for approximating the solution to the normalized-cut while the unnormalized graph Laplacian is used for approximating the solution to the ratio-cut (c.f. more details in section IV).
Def. 2: This function calls the previous function to construct either the normalized or unnormalized graph Laplacian, and also uses the eigsh() function imported from SciPy to find and return ‘num_clusters’ different eigenvectors. These eigenvectors come column-by-column in a matrix where each row represents the embedding of a given node.
Def. 3: Having found the embedding matrix, this function performs k-means clustering on the node embeddings and returns a vector where every entry indicates the predicted cluster this node belongs to.
Def. 4 & 5: Using the indicator vector provided by the previous function, these two functions use the predicted cluster indicators to cut the graph and return the value of the ratio-cut and normalized-cut respectively.
Def. 6: This function is used to evaluate the performance of our spectral clustering algorithm, by combining our feature matrix and corresponding category list to label and count the nodes in each cluster, and print this result.
Section III: Evaluation
Now we are finally ready to test our functions and evaluate how well our algorithm performed by also analyzing the provided feature matrix. We begin by calling our main clustering function, and then evaluating its found clusters by the following function.
predicted_clusters = spectral_clustering(A, num_clusters, True) print_top_categories_for_each_cluster(5, z_norm, F, categories)
Which leads to the following output,
>> Most popular categories in cluster 0 >> - Breakfast & Brunch, 636 >> - Sandwiches, 528 >> - Italian, 514 >> - Pizza, 482 >> - Coffee & Tea, 473 >> >> Most popular categories in cluster 1 >> - Japanese, 529 >> - Chinese, 441 >> - Asian Fusion, 414 >> - Sushi Bars, 408 >> - Korean, 406 >> >> Most popular categories in cluster 4 >> - Seafood, 315 >> - Mexican, 314 >> - Sandwiches, 294 >> - Japanese, 291 >> - Breakfast & Brunch, 286
We find that clusters 0, 1, and 4 led to the most obvious distinctions in food preferences, with cluster 0 clearly corresponding to European food, cluster 1 to Asian food, and cluster 4 to coastal food. The next most obvious cluster was American food, though it showed much overlap with cluster 0 for the obvious reason that both are quite similar, with the only difference coming from replacing ‘Coffee & Tee’ with ‘American (Traditional).’ It also should not come as a surprise that ‘Breakfast & Brunch’ shows up in essentially every cluster, as this is less of a food preference and more of a universal, temporal marker for when people go out to eat.
Section IV: Further Takeaways & Improvements
The difference between the normalized and unnormalized Laplacian offers insights into the normalized-cut and ratio-cut optimization procedures, along with the kinds of problems they are better suited to solve. Spectral graph theory tells us that the normalized graph Laplacian is better suited for irregular graphs than the unnormalized Laplacian, where “irregularity” in a graph implies variability in the degrees of its nodes. There are degrees to which a graph can be more irregular than others, though it should come to no surprise that when dealing with graphs in the real world, there must be very specific conditions known A-Priori which should lead an engineer to suppose that the graph under consideration is approximately regular (i.e. the degrees of all nodes are approximately the same). For this reason, it is best practice to work with the unnormalized graph Laplacian when undertaking an application in spectral clustering.
We can further test the integrity of this best practice by re-applying our spectral clustering algorithm onto our graph, and evaluating the returned cluster-sizes, normalized-cut, and ratio-cut. First, we test this best practice by using the unnormalized graph Laplacian.
predicted_clusters = spectral_clustering(A, num_clusters, False) compute_ratio_cut(A, predicted_clusters) compute_normalized_cut(A, predicted_clusters)
And we find the following results,
>> When using L_unnorm: >> Ratio cut = 369.109 >> Normalized cut = 5.000 >> Sizes of the partitions are: [3379, 1, 1, 1, 1, 1]
Now by using the normalized graph Laplacian, where we switch ‘False’ to ‘True’, we find,
>> When using L_norm: >> Ratio cut = 5942.851 >> Normalized cut = 4.343 >> Sizes of partitions are: [742, 577, 754, 572, 350, 389]
Clearly the normalized graph Laplacian yields better results, where the unnormalized Laplacian seems to suffer from the same problem exhibited by the minimum-cut problem.
An important property to realize about spectral clustering, is that the node features were not taken into account when finding the clusters. This is not necessarily bad thing, as sometimes it can be favorable to cluster a graph based solely on its connectivity structure, as doing so can serve to enlighten one to the underlying communities affected by said structure. Understanding the communities intrinsic to a graph without influence by external node features or properties can serve also as a benchmark to which compare other algorithms which cannot cluster properly without said features. However, it is generally the case that features play an equally important role in understanding or extracting information from a graph, so as a result, clustering algorithms such as Graph2Gauss are able to capture more complex structures.
On a final note, this article has focused on a specific perspective on graph clustering, namely through an optimization point-of-view by finding a community assignment that optimizes some criterion. There also exists an alternative view on graph clustering, namely a probabilistic view, which considers a graph as a realization drawn from a generative process which is controlled by a set of parameters and where communities are modeled within that generative process (such as through Planted Partition Models or Stochastic Block Models). In this perspective, finding a good community assignment is equivalent to performing inference in the model, and can have the advantage of capturing uncertainty, handle missing or noisy data, and also generate new data through e.g. link prediction.
References
The colored node embedding diagram was taken from: S. Günnemann, Machine Learning for Graphs and Sequential Data. Technische Universität München, 2023.
Link to GitHub
To download the Python files and test the model for yourself, click here.