Suppose we intend to cluster a large or very large data set present on a
disk. We assume for large or very large data sets the data set size exceeds the
memory size. We assume the data set is randomly scrambled on the disk. We can
only load a certain percentage of the data based on the available memory
allocation. If we load 1% of the data into memory at a time then we have to do
it 100 times to scan through the entire data set. We call each such data access
a partial data access (PDA). The number of partial data accesses will depend on
how much data we load each time.In our approach, after the first PDA, data is
clustered into c partitions using fuzzy c means. Then the data in memory is
condensed into c weighted points and clustered with new points loaded in the
next PDA. We call them “weighted” points because they are associated with
weights, which are calculated by summing the membership of examples in a
cluster. In each PDA new singleton points are loaded into memory and clustered
along with the past c weighted points obtained
from the previous clustering. We will call this partial data clustering (PDC).
After clustering these new singleton points along with the past c weighted
points, they are condensed again into c new higher weighted points and clustered
with examples loaded in the next PDA. This continues until all the data has been
scanned once. The objective function of fuzzy c means was modified to
accommodate the effect of weights.
Source code for single pass FCM can be downloaded here.
Consider a data set that has a maximum of c classes. Due to
the constraints of limited memory and computation time, an online algorithm will
be able to load only a small amount of data at a time depending upon the speed
of data arrival and hardware capability. We assume data is both arriving and
processed in chunks (equal to the buffer size), that is, n1 data points arrive
at time t1, n2
at t2, and so on. Then in the worst case a given chunk of data might come from
one class only and in the best case data might come from all c classes. So, if
we set the number of clusters to be always c (highest resolution) what effect it
will have on different mixtures?
Case A: If less than c classes come in a chunk, then we are
overclustering. Overclustering does not cause any information loss as it will
likely split homogeneous clusters. Information loss only occurs when we
undercluster, which might group non-alike examples together.
Case B: If exactly all c classes come in a chunk, then we are
neither overclustering nor underclustering. So, setting the number of clusters
always equal to c, the maximum number of classes in the data set, may not cause
any information loss.
So, we set number of clusters to be c in each chunk. After
clustering data arriving at each time instant by FCM (a chunk), memory is freed
by condensing the clustering solution in the memory into c weighted points. The
c weighted points are represented by
the c cluster centroids obtained after clustering. The summarization of
clustering results involves the U matrix, which indicates the fuzziness of the
example’s membership in each cluster. This is one key difference in our
summarization from crisp algorithms.
For example if there are ni examples in memory at time instant ti then after
clustering, memory is freed by summarizing the clustering result by c weighted
centroids. The weighted centroids are saved on the disk. At the end, the
weighted centroids of all chunks
form an ensemble of weighted clustering solutions. The ensemble is then merged
into the final c clusters. The merging operation is done by clustering all the
weighted centroids in the ensemble using their weights. Weighted FCM (WFCM) has
been used for this purpose. The only difference is that there are no singleton
points now. To speed up clustering, we initialize clustering of each chunk
with the final centroids obtained from clustering the previous chunk. This
knowledge propagation allows for faster convergence, provided the distribution
does not change rapidly, which may often be the case.
Source code for online FCM can be downloaded here.