Single pass and online Fuzzy C means algorithms

Single pass Fuzzy C means algorithm:

Suppose we intend to cluster a large or very large data set present on a disk. We assume for large or very large data sets the data set size exceeds the memory size. We assume the data set is randomly scrambled on the disk. We can only load a certain percentage of the data based on the available memory allocation. If we load 1% of the data into memory at a time then we have to do it 100 times to scan through the entire data set. We call each such data access a partial data access (PDA). The number of partial data accesses will depend on how much data we load each time.In our approach, after the first PDA, data is clustered into c partitions using fuzzy c means. Then the data in memory is condensed into c weighted points and clustered with new points loaded in the next PDA. We call them “weighted” points because they are associated with weights, which are calculated by summing the membership of examples in a
cluster. In each PDA new singleton points are loaded into memory and clustered along with the past c weighted points obtained
from the previous clustering. We will call this partial data clustering (PDC). After clustering these new singleton points along with the past c weighted points, they are condensed again into c new higher weighted points and clustered with examples loaded in the next PDA. This continues until all the data has been scanned once. The objective function of fuzzy c means was modified to accommodate the effect of weights.

Source code for single pass FCM can be downloaded here.

Online Fuzzy C means algorithm:

Consider a data set that has a maximum of c classes. Due to the constraints of limited memory and computation time, an online algorithm will be able to load only a small amount of data at a time depending upon the speed of data arrival and hardware capability. We assume data is both arriving and processed in chunks (equal to the buffer size), that is, n1 data points arrive at time t1, n2
at t2, and so on. Then in the worst case a given chunk of data might come from one class only and in the best case data might come from all c classes. So, if we set the number of clusters to be always c (highest resolution) what effect it will have on different mixtures?

Case A: If less than c classes come in a chunk, then we are overclustering. Overclustering does not cause any information loss as it will likely split homogeneous clusters. Information loss only occurs when we undercluster, which might group non-alike examples together.

Case B: If exactly all c classes come in a chunk, then we are neither overclustering nor underclustering. So, setting the number of clusters always equal to c, the maximum number of classes in the data set, may not cause any information loss.

So, we set number of clusters to be c in each chunk. After clustering data arriving at each time instant by FCM (a chunk), memory is freed by condensing the clustering solution in the memory into c weighted points. The c weighted points are represented by
the c cluster centroids obtained after clustering. The summarization of clustering results involves the U matrix, which indicates the fuzziness of the example’s membership in each cluster. This is one key difference in our summarization from crisp algorithms.

For example if there are ni examples in memory at time instant ti then after clustering, memory is freed by summarizing the clustering result by c weighted centroids. The weighted centroids are saved on the disk. At the end, the weighted centroids of all chunks
form an ensemble of weighted clustering solutions. The ensemble is then merged into the final c clusters. The merging operation is done by clustering all the weighted centroids in the ensemble using their weights. Weighted FCM (WFCM) has been used for this purpose. The only difference is that there are no singleton points now. To speed up clustering, we initialize clustering of each chunk
with the final centroids obtained from clustering the previous chunk. This knowledge propagation allows for faster convergence, provided the distribution does not change rapidly, which may often be the case.

Source code for online FCM can be downloaded here.

Related Publications:
[1] P. Hore, L.O. Hall, and D.B. Goldgof, A Scalable Framework For Cluster Ensembles, Pattern Recognition, 42 (2009), pp. 676-688
[2] P. Hore, L.O. Hall, D.B. Goldgof, Y. Gu, A.A. Maudsley and A. Darkazanli, A Scalable Framework For Segmenting Magnetic Resonance Images Journal of Signal Processing Systems, (Online) DOI10.1007/s11265-008-0243-1, Volume 54, Issue 1 (2009), Page 183-203.
[3] P. Hore, L.0. Hall, D. Goldgof and W. Cheng, Online Fuzzy C Means, NAFIPS, May, 2008.
[4] Prodip Hore, Lawrence O. Hall, and Dmitry B. Goldgof, Single Pass Fuzzy C Means, IEEE International Conference on Fuzzy Systems, London, 2007