DATA MINING

INTRODUCTION

Data mining is the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis. Data mining tools allow enterprises to predict future trends. In data mining, association rules are created by analyzing data for frequent if/then patterns, then using the support and confidence criteria to locate the most important relationships within the data. Support is how frequently the items appear in the database, while confidence is the number of times if/then statements are accurate.

Data mining techniques are used in many research areas, including mathematics, cybernetics, genetics and marketing. While data mining techniques are a means to drive efficiencies and predict customer behavior, if used correctly, a business can set itself apart from its competition through the use of predictive analysis. Web mining, a type of data mining used in customer relationship management, integrates information gathered by traditional data mining methods and techniques over the web. Web mining aims to understand customer behavior and to evaluate how effective a particular website is.

In general, the benefits of data mining come from the ability to uncover hidden patterns and relationships in data that can be used to make predictions that impact businesses. Specific data mining benefits vary depending on the goal and the industry. Sales and marketing departments can mine customer data to improve lead conversion rates or to create one-to-one marketing campaigns. Data mining information on historical sales patterns and customer behaviors can be used to build prediction models for future sales, new products and services.

DEFINITION OF DATA MINING

Data mining is a process used by companies to turn raw data into useful information. By using software to look for patterns in large batches of data, businesses can learn more about their customers to develop more effective marketing strategies, increase sales and decrease costs. Data mining depends on effective data collection, warehousing, and computer processing.

The data mining process breaks down into five steps. First, organizations collect data and load it into their data warehouses. Next, they store and manage the data, either on in-house servers or the cloud. Business analysts, management teams and information technology professionals access the data and determine how they want to organize it. Then, application software sorts the data based on the user's results, and finally, the end-user presents the data in an easy-to-share format, such as a graph or table.

HISTORY OF DATA MINING

Data mining is everywhere, but its story starts many years before Moneyball and Edward Snowden. The following are major milestones and “firsts” in the history of data mining plus how it’s evolved and blended with data science and big data.

Data mining is the computational process of exploring and uncovering patterns in large data sets a.k.a. Big Data. It’s a subfield of computer science which blends many techniques from statistics, data science, database theory and machine learning.

- Year 1763: Thomas Bayes’ paper is published posthumously regarding a theorem for relating current probability to prior probability called the Bayes’ theorem. It is fundamental to data mining and probability, since it allows understanding of complex realities based on estimated probabilities.

- Year 1805: Adrien-Marie Legendre and Carl Friedrich Gauss apply regression to determine the orbits of bodies about the Sun (comets and planets). The goal of regression analysis is to estimate the relationships among variables, and the specific method they used in this case is the method of least squares. Regression is one of the key tools in data mining.

- Year 1936: This is the dawn of computer age which makes possible the collection and processing of large amounts of data. In a 1936 paper, On Computable Numbers, Alan Turing introduced the idea of a Universal Machine capable of performing computations like our modern day computers. The modern day computer is built on the concepts pioneered by Turing.

- Year 1943: Warren McCulloch and Walter Pitts were the first to create a conceptual model of a neural network. In a paper entitled A logical calculus of the ideas immanent in nervous activity, they describe the idea of a neuron in a network. Each of these neurons can do 3 things: receive inputs, process inputs and generate output.

- Year 1965: Lawrence J. Fogel formed a new company called Decision Science, Inc. for applications of evolutionary programming. It was the first company specifically applying evolutionary computation to solve real-world problems.

- Year 1970s: With sophisticated database management systems, it’s possible to store and query terabytes and petabytes of data. In addition, data warehouses allow users to move from a transaction-oriented way of thinking to a more analytical way of viewing the data. However, extracting sophisticated insights from these data warehouses of multidimensional models is very limited.

- Year 1975: John Henry Holland wrote Adaptation in Natural and Artificial Systems, the ground-breaking book on genetic algorithms. It is the book that initiated this field of study, presenting the theoretical foundations and exploring applications.

- Year 1980s: HNC trademarks the phrase “database mining.” The trademark was meant to protect a product called DataBase Mining Workstation. It was a general purpose tool for building neural network models and now no longer is available. It’s also during this period that sophisticated algorithms can “learn” relationships from data that allow subject matter experts to reason about what the relationships mean.

- Year 1989: The term “Knowledge Discovery in Databases” (KDD) is coined by Gregory Piatetsky-Shapiro. It also at this time that he co-founds the first workshop also named KDD.

- Year 1990s: The term “data mining” appeared in the database community. Retail companies and the financial community are using data mining to analyze data and recognize trends to increase their customer base, predict fluctuations in interest rates, stock prices, customer demand.

- Year 1992: Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested an improvement on the original support vector machine which allows for the creation of nonlinear classifiers. Support vector machines are a supervised learning approach that analyzes data and recognizes patterns used for classification and regression analysis.

- Year 1993: Gregory Piatetsky-Shapiro starts the newsletter Knowledge Discovery Nuggets (KDnuggets). It was originally meant to connect researchers who attended the KDD workshop. However, KDnuggets.com seems to have a much wider audience now.

- Year 2001: Although the term data science has existed since 1960s, it wasn’t until 2001 that William S. Cleveland introduced it as an independent discipline.

- Year 2003: Moneyball, by Michael Lewis, is published and changed the way many major league front offices do business. The Oakland Athletics used a statistical, data-driven approach to select for qualities in players that were undervalued and cheaper to obtain. In this manner, they successfully assembled a team that brought them to the 2002 and 2003 playoffs with 1/3 the payroll.

- Year 2015: In February 2015, DJ Patil became the first Chief Data Scientist at the White House. Today, data mining is widespread in business, science, engineering and medicine just to name a few. Mining of credit card transactions, stock market movements, national security, genome sequencing and clinical trials are just the tip of the iceberg for data mining applications. Terms like Big Data are now commonplace with the collection of data becoming cheaper and the proliferation of devices capable of collecting data.

- Present (2020) - Finally, one of the most active techniques being explored today is Deep Learning. Capable of capturing dependencies and complex patterns far beyond other techniques, it is reigniting some of the biggest challenges in the world of data mining, data science and artificial intelligence.

TECHNIQUES OF DATA MINING

Data mining is highly effective, so long as it draws upon one or more of these techniques:

1. Tracking patterns. One of the most basic techniques in data mining is learning to recognize patterns in your data sets. This is usually a recognition of some aberration in your data happening at regular intervals, or an ebb and flow of a certain variable over time.

2. Classification. Classification is a more complex data mining technique that forces you to collect various attributes together into discernable categories, which you can then use to draw further conclusions, or serve some function.

3. Association. Association is related to tracking patterns, but is more specific to dependently linked variables. In this case, you’ll look for specific events or attributes that are highly correlated with another event or attribute.

4. Outlier detection. In many cases, simply recognizing the overarching pattern can’t give you a clear understanding of your data set. You also need to be able to identify anomalies, or outliers in your data.

5. Clustering. Clustering is very similar to classification, but involves grouping chunks of data together based on their similarities.

6. Regression. Regression, used primarily as a form of planning and modeling, is used to identify the likelihood of a certain variable, given the presence of other variables.

7. Prediction. Prediction is one of the most valuable data mining techniques, since it’s used to project the types of data you’ll see in the future. In many cases, just recognizing and understanding historical trends is enough to chart a somewhat accurate prediction of what will happen in the future.

REFERENCES

M.S. Chen.J.Han and P.S. Yu. Data mining: An overview from a database perspective. IEEE transactions on Knowledge and data engineering 8:866.

Feldman, Ronen, Will Klosgen, and Amir Zilberstein. “Visualization techniques to explore data mining results for document collections.”, In Proceedings ofthe Third Annual Conference on KnowledgeDiscovery and Data Mining (KDD), Newport Beach, 1997

Koperski, J. Adhikary and J. Han, "Spatial Data Mining: Progress and Challenges", SIGMOD'96Workshop on Research Issues in Data Mining and Knowledge Discovery DMKD'96, Montreal,Canada.

ENEMS MICROSYSTEMS RESEARCH SPOT

Saturday, 28 May 2022

DATA MINING

DATA MINING

HISTORY OF DATA MINING

No comments:

Post a Comment