臺大管理論叢
第
27
卷第
2S
期
49
were implemented using Microsoft Visual C++ 2010. Section 4.1 introduces the datasets.
Sections 4.2 and 4.3 present the two set of experiments, respectively.
4.1 The Synthetic and Real Datasets
The procedure for generating the synthetic datasets is as follows. A transaction contains
at most 15 attributes, i.e., the length of a transaction is at most 15. An interval of quantitative
values between 0 and 511 is generated for each attribute. The number of transactions, i.e., the
size
of a dataset, is set at 100,000 for each synthetic dataset.
The first real dataset, AirQuality, contains indices of the daily air quality in Taiwan in
2008, containing such measures as the concentrations of suspended particulates, sulfur
dioxide, and nitrogen dioxide. The data was collected by the Taiwan Environmental
Protection Administration and can be downloaded from the Environmental Protection
Administration (2015). We selected five indices from the AirQuality dataset, namely,
suspended particulates, sulfur dioxide, nitrogen dioxide, carbon monoxide, and ozone.
According to Environmental Protection Administration (2015), these are the key indices used
to measure the level of air quality. For every observation station, the original AirQuality
dataset lists hourly readings for each index. However, to mine FU2Ps, we modify the dataset
to obtain five intervals formed by the daily minimum and maximum values of the five
indices at each observation station. There are 26,527 transactions in total.
The second real dataset, DY2009, contains a variety of data on daily weather conditions
in Taiwan in 2009, including atmospheric pressure, temperature, and relative humidity
readings. The data was collected by the Department of Atmospheric Science, National
Taiwan University, and can be downloaded from the Taiwan Typhoon and Flood Research
Institute (2015). For each transaction, we selected the following five attributes from the
DY2009 dataset: atmospheric pressure at the observation stations, atmospheric pressure at
sea level, temperature, vapor pressure, and relative humidity. The original dataset is reduced
to a set of transactions, each of which contains five intervals that indicate the daily minimum
and maximum values of the five attributes with respect to an observation station. There are
8,187 transactions in the DY2009 dataset. In addition, the probability density function
associated with each interval in the synthetic and real datasets is set as a uniform
distribution.
4.2 The Experiment Results under Various Parameter Settings
The SFC algorithm uses three parameters, i.e.,
w
,
ξ
, and
δ
, in the clustering process. To