--- title: "Soft community detection in networks with nmfkc.net" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Soft community detection in networks with nmfkc.net} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction `nmfkc.net()` is a **symmetric** NMF for network (adjacency) matrices. With `type = "bi"` it factorises a symmetric \eqn{N \times N} matrix as \eqn{Y \approx X X^\top} with \eqn{X \ge 0}. Each row of \eqn{X} is a node's **graded membership** across communities, so this is *soft* community detection: unlike a hard method (e.g.\ Louvain), boundary nodes are revealed by a mixed membership rather than being forced into one group. This vignette uses a **self-contained synthetic network with a known community structure**, so we can check that `nmfkc.net()` recovers it and that `nmfkc.net.rank()` finds the right number of communities. ```{r lib} library(nmfkc) ``` ## A synthetic network (stochastic block model, 3 communities) We draw an undirected graph on `60` nodes from a **stochastic block model**: three communities of 20 nodes, with a high within-community edge probability and a low between-community one. We then deliberately add **three "bridge" nodes** (one per community), each also wired to a *second* community, so that the network contains a few genuinely mixed nodes. The true community of each node is known. ```{r data, fig.width = 6, fig.height = 5.2} set.seed(1) K <- 3; n_each <- 20; N <- K * n_each # 60 nodes, 3 communities block <- rep(1:K, each = n_each) # true community labels p_in <- 0.6; p_out <- 0.05 Prob <- matrix(p_out, N, N) for (k in 1:K) Prob[block == k, block == k] <- p_in Y <- matrix(rbinom(N * N, 1, Prob), N, N) # 0/1 adjacency Y[lower.tri(Y)] <- t(Y)[lower.tri(Y)] # make symmetric diag(Y) <- 0 # add three bridge nodes, each also linked to a second community bridge <- c(20, 40, 60); into <- c(2, 3, 1) for (i in seq_along(bridge)) { b <- bridge[i]; tgt <- which(block == into[i]) e <- rbinom(length(tgt), 1, 0.45) Y[b, tgt] <- pmax(Y[b, tgt], e); Y[tgt, b] <- Y[b, tgt] } isSymmetric(Y); sum(Y) / 2 # symmetric, number of edges # adjacency matrix in the original (random) node order -- structure is hidden image(1:N, 1:N, Y[, N:1], col = c("white", "steelblue"), xlab = "node", ylab = "node", main = "Adjacency (original order)") ``` ## Soft community detection with `nmfkc.net` ```{r fit} res <- nmfkc.net(Y, rank = 3, type = "bi", nstart = 10, maxit = 500) res$r.squared # fit head(round(res$X.prob, 2)) # soft membership: rows = nodes, columns = communities head(res$X.cluster) # hard label = argmax membership ``` ## Did it recover the known communities? ```{r compare} table(true = block, estimated = res$X.cluster) # the three bridge nodes have genuinely split memberships round(res$X.prob[bridge, ] * 100, 1) ``` The cross-tabulation is essentially block-diagonal: apart from the bridge nodes, every node is assigned to its true community. The bridge nodes (rows above) are roughly **50/50 between two communities**, so their hard label simply tips to whichever side is marginally stronger --- exactly the situation where a *soft* membership is more informative than a forced assignment. ## Visualising the result Re-ordering the nodes by the **recovered** community turns the adjacency matrix into a clean block-diagonal, and the membership matrix shows three crisp blocks: ```{r viz, fig.width = 8, fig.height = 4} ord <- order(res$X.cluster) op <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) image(1:N, 1:N, Y[ord, rev(ord)], col = c("white", "steelblue"), xlab = "node (reordered)", ylab = "", main = "Adjacency by community") image(1:N, 1:K, res$X.prob[ord, , drop = FALSE], col = hcl.colors(20, "Blues", rev = TRUE), xlab = "node (reordered)", ylab = "community", axes = FALSE, main = "Soft membership") axis(1); axis(2, at = 1:K) par(op) ``` ## Network diagram (Graphviz DOT): soft vs. hard `nmfkc.net.DOT()` turns the fit into a Graphviz **DOT** graph: each network node is linked to the community (basis) hubs by edges whose thickness encodes the membership strength. We raise `threshold` to `0.25` so that only substantial memberships are drawn --- each pure node then links to a *single* hub, which lets the `neato` layout pull the three communities apart, while the bridge nodes keep *both* links and settle in between. Rendered with the suggested package **DiagrammeR**, it embeds directly in the HTML output. The only difference between the two views below is the **node colour**: * `Y.cluster = "soft"` (default) blends the community colours in proportion to the membership, so the three bridge nodes appear in a **mixed colour**. * `Y.cluster = "hard"` paints each node in the **single colour** of its dominant community. ```{r dot-check, include = FALSE} has_dg <- requireNamespace("DiagrammeR", quietly = TRUE) ``` **Soft** -- the three bridge nodes added earlier (`bridge <- c(20, 40, 60)`, each wired to a second community) show a blended colour: ```{r dot-soft, eval = has_dg, fig.width = 7, fig.height = 7} dot_soft <- nmfkc.net.DOT(res, Y.label = as.character(1:N), X.label = paste("Community", 1:K), Y.title = "Network nodes", X.title = "Communities", layout = "neato", threshold = 0.25, Y.cluster = "soft") DiagrammeR::grViz(as.character(dot_soft)) ``` **Hard** -- those same bridge nodes (20, 40, 60) are forced into one community colour: ```{r dot-hard, eval = has_dg, fig.width = 7, fig.height = 7} dot_hard <- nmfkc.net.DOT(res, Y.label = as.character(1:N), X.label = paste("Community", 1:K), Y.title = "Network nodes", X.title = "Communities", layout = "neato", threshold = 0.25, Y.cluster = "hard") DiagrammeR::grViz(as.character(dot_hard)) ``` (If DiagrammeR is not installed, both diagrams are skipped.) ## How many communities? `nmfkc.net.rank` The same rank-selection diagnostics as `nmfkc.rank` are available for the symmetric model. The recommended rank is the cross-validation (ECV) minimum; the effective rank and R-squared elbow corroborate it. ```{r rank, fig.width = 7, fig.height = 6} rk <- nmfkc.net.rank(Y, rank = 1:6, type = "bi", nstart = 5) rk$rank.best round(rk$criteria, 3) ``` ## Remarks `nmfkc.net()` recovers the three planted communities for the core nodes, and `nmfkc.net.rank()` selects three. Because the membership is *soft* (`res$X.prob`), the three bridge nodes that sit between communities show a split membership instead of being forced into one group --- the practical advantage of NMF over hard community detection, made visible by the soft-vs-hard DOT graphs above. `type = "bi"` (\eqn{Y \approx X X^\top}) is the usual choice for an undirected, non-negative network; `type = "tri"` (\eqn{Y \approx X C X^\top}) adds a community-interaction matrix \eqn{C}, and `type = "signed"` handles networks with negative weights. For choosing the number of communities, the same advice as in the *rank-selection* vignette applies: use the ECV minimum as the primary criterion and corroborate it with the effective rank. See `?nmfkc.net`, `?nmfkc.net.ecv`, `?nmfkc.net.rank`, and the companion vignette *"Choosing the NMF rank on data with a known true rank"*.