MIDTERM-ADVANCED DBM C3 L2

Jamaica Rose Gilo

問題数 97 • 11/3/2024

記憶度

完璧

14問

覚えた

36問

うろ覚え

0問

苦手

0問

未解答

0問

アカウント登録して、解答結果を保存しよう

問題一覧

fundamental technique in data mining used to classify data into different categories based on a set of predefined rules

Rule-Based Classification

Rules used in Rule-Based Classification are often derived from what? making the process both intuitive and interpretable

Data itself

supervised learning task where a model learns to map input data to a specific category or label

Classification

The goal is to predict the class of unseen instances based on what was learned from the training data.

Classification

takes the form: IF condition THEN class

Rules

can be applied sequentially to classify data

Rule Set

Rules are often ordered by a specific priority or confidence level.

Rule Set

How Rule-Based Classification Works?

Rule Generation, Rule Evaluation, Rule Matching, Conflict Resolution

Rules are generated from the training data based on the relationships between different features and the target class.

Rule Generation

Techniques such as decision trees (e.g., ID3, C4.5), association rule mining (e.g., Apriori), or direct heuristics (e.g., covering algorithms) can be used to extract rules

Rule Generation

Rules that correctly classify more instances with high accuracy are generally preferred.

Rule Evaluation

Each rule is evaluated based on its accuracy or coverage.

Rule Evaluation

When classifying a new instance, the rule-based system evaluates which rules match the instance’s attributes.

Rule Matching

In some cases, multiple rules might apply to a single instance, leading to conflicts.

Conflict Resolution

Conflict Resolution Strategies

Rule Priority, Voting

Several rules contribute votes to different classes, and the majority vote determines the class.

Voting

More specific rules or rules with higher confidence may take precedence

Rule Priority

These methods generate classification rules directly from the data without creating an intermediate model.

Direct Methods

an efficient algorithm for generating classification rules

RIPPER

RIPPER stands for?

Repeated Incremental Pruning to Produce Error Reduction

This is a simple algorithm that generates rules based on a single attribute at a time.

OneR

OneR stands for?

One Rule

These methods derive rules from another classification model.

Indirect Methods

It’s possible, though more complex, to extract rules from this by analyzing the learned weights.

Neural Networks

A model that can be converted into a set of classification rules

Decision Trees

This makes rule-based classifiers highly interpretable compared to other methods like neural networks or support vector machines.

Interpretability

The rules generated by this method are easy to understand and explain.

Interpretability

Since rules are explicit, users can easily track how decisions are made, which is crucial for applications in sensitive fields like medical diagnosis or finance.

Transparency

The system can be extended by adding more rules or adjusting the existing rules, allowing for greater adaptability to changes in the environment or new data.

Flexibility

Rule-based systems often evaluate only a small subset of rules for each instance, reducing overall complexity.

Efficiency

This happens when many specific rules are generated that capture noise rather than general patterns.

Overfitting

Some rules may overlap and can complicate the classification process.

Rule Redundancy

It is when rule-based classifiers with intricate patterns require more sophisticated models like ensemble methods or deep learning.

Limited Performance on Complex Data

Applications of Rule-Based Classifiers

Medical Diagnosis, Fraud Detection, Customer Segmentation

Businesses can classify customers based on behavioral data to tailor marketing strategies.

Customer Segmentation

In banking and finance, rules can be used to detect fraudulent activities based on suspicious patterns.

Fraud Detection

Rule-based classifiers can generate transparent decisionmaking models to assist healthcare professionals in diagnosing diseases.

Medical Diagnosis

This offers an interpretable and intuitive approach to classifying data by deriving simple IF-THEN rules from training data.

Rule-Based Classification

While this method excels in domains that require transparency and ease of understanding, it may struggle with complex or noisy datasets.

Rule-Based Classification

With complex datas, it's often used in conjunction with other classification techniques for enhanced performance in real-world applications.

Rule-based Classification

A process used by the companies to turn raw data to useful information.

Data Mining

an essential tool that allows us to turn raw data into actionable insights

Data Mining

6 Data Mining Tasks

Classification, Clustering, Regression, Association Rule Learning, Anomaly Detection, Summarization

It's about predicting the category or class of a data point based on past data

Classification

An example of this is Predicting whether an email is spam or not spam (spam filtering).

Classification

The algorithm is trained on labeled data (data with known categories) and learns to assign new data points to one of these categories.

Classification

Applications of classification

Email Filtering, Medical Diagnosis, Sentiment Analysis, Image Recognition

Spam detection

Email Filtering

Classifying diseases based on symptoms and test results.

Medical Diagnosis

Determining whether a piece of text expresses positive, negative, or neutral sentiment.

Sentiment Analysis

Classifying images into categories (e.g., identifying objects in photos).

Image Recognition

It involves grouping similar data points together based on their characteristics without predefined labels.

Clustering

An example of this is: Grouping customers based on their shopping habits.

Clustering

Data points that are similar to each other are clustered into groups, helping to discover natural structures within the data.

Clustering

Clustering Algorithms

K-means, Hierarchical Clustering, DBSCAN, Gaussian Mixture Models

Partitions data into a fixed number of clusters based on distance to centroids.

K-Means

Builds a tree of clusters based on similarity, allowing for a hierarchy of clusters.

Hierarchical Clustering

Groups together points that are closely packed, marking points in low- density regions as outliers.

DBSCAN

DBSCAN stands for

Density-Based Spatial Clustering of Applications with Noise

Assumes data points are generated from a mixture of several Gaussian distributions.

Gaussian Mixture Models

It ssion is used to predict a numerical value based on the relationship between variables

Regression

An example of this is: Predicting house prices based on factors like location, size, and age.

Regression

The algorithm tries to fit a line (or curve) that best describes the relationship between variables to predict continuous values.

Regression

Common Regression Algorithms

Linear Regression, Polynomial Regression, Ridge and Lasso Regression, Decision Trees and Random Forests, Support Vector Regression

Models the relationship between dependent and independent variables as a straight line.

Linear Regression

Extends linear regression by fitting a polynomial equation to the data.

Polynomial Regression

Regularization techniques that prevent overfitting by adding a penalty for large coefficients.

Ridge and Lasso Regression

Can also be used for regression tasks by predicting values based on tree structures.

Decision Trees and Random Forests

Uses support vector machines for regression tasks.

Support Vector Regression

Applications of Regression

Financial Forecasting, Sales Prediction, Risk Assessment, Marketing Response Modeling

It is about finding relationships between variables in a large dataset.

Association Rule Learning

An example of this is: Market basket analysis, where you discover that customers who buy bread are likely to also buy butter

Association Rule Learning

It identifies sets of items that frequently appear together in the data (like discovering frequent itemsets).

Association Rule Learning

Applications of Association Rule Learning

Market basket analysis to improve product placement and cross-selling strategies, Recommender systems in e-commerce, Customer segmentation and behavior analysis, Web usage mining to understand navigation patterns.

It identifies data points that deviate significantly from the normal pattern.

Anomaly Detection

An example of this is: Detecting fraudulent credit card transactions.

Anomaly Detection

In anomaly detection, the algorithm learns what normal data looks like, and anything that falls far outside this pattern is flagged as an what?

Anomaly

3 types of Anomalies

Point Anomalies, Contextual Anomalies, Collective Anomalies

A single data point that deviates significantly from the rest of the dataset. For example, a sudden spike in credit card transactions could indicate fraud.

Point Anomalies

Anomalies that are only considered unusual in a specific context. For example, a high temperature reading is normal in summer but anomalous in winter.

Contextual Anomalies

A group of data points that collectively deviate from the expected pattern, even if individual points are not anomalies. For instance, a sudden increase in network traffic over a short period may indicate a DDoS attack.

Collective Anomalies

Applications of Anomaly Detection

Fraud detection in banking and e-commerce, Network intrusion detection to identify malicious activity, Fault detection in manufacturing processes or machinery, Health monitoring to detect unusual patterns in patient data, Quality control in production processes.

This creates a compact representation of the data, providing a summary of key information

Summarization

An example of this is: Summarizing a dataset by showing averages, counts, and other statistics.

Summarization

This task reduces the complexity of the data by generating overviews and simplified reports.

Summarization

2 types of summarization

Extractive Summarization, Abstractive Summarization

Involves selecting key sentences or phrases directly from the original text to create a summary.

Extractive Summarization

Techniques may include ranking sentences based on their importance using algorithms like TextRank or TF-IDF.

Extractive Summarization

Involves generating new sentences that convey the main ideas of the text, potentially using different wording than the original.

Abstractive Summarization

Often utilizes advanced machine learning models, such as transformer- based architectures (e.g., BERT, GPT).

Abstractive Summarization

Applications of Summarization

News aggregation services that provide concise articles, Summarizing research papers or reports for quick understanding, Document summarization in legal and business contexts, Enhancing user experience in chatbots by providing brief responses, Why These Tasks Matter

This helps in decision-making (e.g., identifying risks).

Classification

It helps in segmenting customers or products for targeted marketing.

Clustering

It helps in forecasting trends and making predictions.

Regression

It helps businesses understand customer behavior.

Association

This improves security by identifying irregularities.

Anomaly Detection

It simplifies data interpretation by providing high-level insights.

Summarization