臺大管理論叢 NTU Management Review VOL.27 NO.2S

為頻繁單變量不確定樣式產生摘要

uncertain data have also been addressed (Liu and Wang, 2013; Liu, 2014); Liu (2013) also

explored mining univariate uncertain data streams. For mining tuple uncertain data, Sun et

al. (2010) proposed two algorithms, the p-Apriori and TODIS, to mine frequent patterns by

utilizing the bottom-up framework and the top-down framework, respectively. Several

studies contributed to the literature on retrieving frequent patterns from itemset uncertain

data. In general, they can be classified as the Apriori-based (Abd-Elmegid, El-Sharkawi,

El-Fangary, and Helmy, 2010; Chui and Kao, 2008; Chui, Kao, and Hung, 2007), the

FP-growth-based (Leung, Carmichael, and Hao, 2007; Leung, Mateo, and Brajczuk, 2008),

and the H-mine-based methods (Aggarwal, Li, Wang, and Wang, 2009). Of these, the

H-mine-based method performed better than the other methods (Aggarwal et al., 2009).

Recently, Muzammal and Raman (2010a, 2010b, 2011) proposed pioneering studies for

mining sequential patterns from uncertain data by utilizing the candidate generate-and-test

methodology. Later, Muzammal (2011) extended their studies by using the pattern growth

methodology. In their studies, the researchers identified two kinds of uncertainty in the

sequence database: source-level uncertainty and event-level uncertainty. The former means

that the source that generates each transaction is uncertain. The latter means that each

transaction in a sequence is uncertain, i.e., it contains tuple uncertain data.

As we state in the end of Section 1, all of the above discussed methods concerning

concise representations handle precise data. None of them can be directly applied to

univariate uncertain data. Furthermore, to the best of our knowledge, construction of a

concise representation of frequent patterns in univariate uncertain data has not been

addressed in the literature. In this study, we propose a novel algorithm for summarizing the

full set of FU2Ps. Our proposed algorithm, the SFC (Summarize FU2Ps by Clustering)

algorithm, is a lossy approximation method. Aside from mining univariate uncertain data, the

SFC algorithm also differs from most of the existing approximation methods that also adopt

the clustering technique in that our method uses real frequent patterns (medoids) as

representative patterns, instead of using virtual patterns (means). The medoid of a cluster is a

frequent pattern in the cluster that has the smallest average distance between this frequent

pattern and other frequent patterns in the cluster. The mean of a cluster is usually the union

of the frequent patterns in the cluster. Using means is prone to bias caused by outliers or

noises, and the means are not real patterns, which may mislead users. In contrast, outliers or

noises do not affect mediods, and medoids are real patterns. In addition, a very important

advantage of using medoids is that the clustering process is greatly accelerated by recording

the distance between each pair of frequent patterns in advance. During the clustering process,