A Scalable Classifier

Posted: May 27, 2013 in Uncategorized


Classification is an important problem in the emerging field of data mining. Although classification has been studied extensively in the past, most of the classification algorithms are designed only for memory-resident data, thus limiting their suitability for data mining in large data sets. This paper discusses issues in building a scalable classifier and presents a data structure called B-tree which efficiently fetches the records from secondary storage devices using indexing.

System requirements :
Since the project is compiled in java SE 1.6 .  you are required to have Java Runtime Environment (JRE) 6 .Please set your Path environment variable before running the software.


Classification is an important data mining problem and can be described as follows. The input data, also called the training set, consists of multiple records, each having multiple attributes or features. Additionally, each example is tagged with a special class label. The objective of classification is to analyze the input data and to develop an accurate description or model for each class using the features present in the data. Another set of input data, also called the test set, consists of multiple records, each having attributes along with class label as that of training set. This test set is used to find how best the constructed model is (accuracy). The class descriptions are used to classify future unknown data for which the class labels are unknown. They can also be used to develop a better understanding of each class in the data. Applications of classification include credit approval, target marketing, medical diagnosis, treatment effectiveness, store location, etc.

In data mining applications, very large training sets with several million examples are common. Our primary motivation in this work is to design a classifier that scales well and can handle training data of this magnitude. The ability to classify larger training data can also improve the classification accuracy.

Below Table shows an example of a training set describing promotion list of the employees in an organization. Each tuple has four attributes- Designation, Education, Experience and Eligible for promotion. Designation is a categorical attribute having three possible values – Scientist, Sr. Scientist and Programmer. Similarly Education also is a categorical attribute having three values – Degree, PG and Ph.D. Experience is a numerical attribute ranging from 0 to 11 and ‘Eligible for promotion’ is the class label indicating whether an employee is eligible for next promotion or not.

Designation Education Experience Label
Scientist PG 7 No
Programmer Degree 8 Yes
Scientist Ph.D 6 Yes
Sr. Scientist Ph.D 3 Yes
Programmer Degree 6 No
Scientist PG 11 Yes
Programmer Degree 6 No
Sr. Scientist Degree 5 No

Note: Here the Label is nothing but  “Eligible for promotion”

Classification can be viewed as a two-step process. First, a model is built from a training set to describe a predetermined set of classes or concepts. The model is represented in the form of decision trees, IF-THEN rules, mathematical formulas etc… then, the model can be used to predict the class labels of future unseen data, whose class labels are not known. Given the training set in table 2.2, it is possible to construct a classifier in the  form of IF-THEN rules in first step, suppose our classifier is as follows:

IF(Designation=Scientist) ∧ (Education=PG) ∧ (Experience>10) THEN (Eligible for promotion=Yes)

Then in the second step, we use this model for classifying the test data whose class label is unknown. Suppose the test data is;

{Designation=Scientist, Education=PG, Experience=7}

Based on our classifier, we classify that the class label for this test data is ‘No’.


If you want this tool, send your request to rishi.techworld@gmail.com. I will send you with documentation and  few restrictions ASAP.

Thank you,

Prabhakara Rao B.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s