Untitled

 avatar
unknown
plain_text
5 months ago
19 kB
2
Indexable
EXPERIMENT 1:
Select any case study on building Data warehouse/Data Mart. Write
Detailed Problem statement and design dimensional modelling (creation of
star and snowflake schema).
Star Schema:
A star schema is a type of data warehouse schema that consists of a central fact table connected to one or more dimensional tables. The fact table contains quantitative data for analysis, while dimension tables hold descriptive attributes related to the fact table.
In the schema you provided, the central fact table is hotel_occupancy_fact, and it connects to various dimensional tables (hotel, room, customer, time, and city). Here's how the star schema is represented:

Star Schema Representation:

Fact Table:

hotel_occupancy_fact:
hotel_id (FK to hotel)
room_id (FK to room)
customer_id (FK to customer)
date (FK to time)
Measures: no_of_occupied_rooms, no_of_vacant_rooms, revenue

Dimension Tables:

hotel (hotel details: name, region, city, star rating)
room (room details: type, occupants, beds)
customer (customer details: name, address, stay type)
time (date details: day, week, month, year)
city (city details: state, country, zipcode)

Snowflake Schema:
A snowflake schema is a more normalized version of a star schema, where dimension tables are further broken down into related sub-dimensions, leading to a "snowflake" shape. In the schema you've created, the city dimension introduces some normalization, as it can serve as a sub-dimension for hotel.

Fact Table:
hotel_occupancy_fact: Same as in the star schema.
Dimension Tables:
hotel (dimension) links to:
city (sub-dimension): Contains city_id, region, state, country, zipcode.
room (unchanged)
customer (unchanged)
time (unchanged)
EXPERIMENT 2:
Select any case study on building Data warehouse/Data Mart and design
Star schema.Implementation of all dimension tables and fact table based on case study 
mysql -u root -p
Enter password: ****
CREATE DATABASE hotel_occupancy;
USE hotel_occupancy;
 CREATE TABLE hotel (
    hotel_id INT AUTO_INCREMENT PRIMARY KEY,
    hotel_name VARCHAR(255) NOT NULL,
    rooms INT NOT NULL,
    star_rating DECIMAL(2,1),
    region VARCHAR(255),
    city VARCHAR(255),
    state VARCHAR(255),
    country VARCHAR(255)
    );
CREATE TABLE room (
room_id INT AUTO_INCREMENT PRIMARY KEY,
room_type VARCHAR(100) NOT NULL,
max_occupants INT NOT NULL,
no_of_beds INT NOT NULL,
ac BOOLEAN DEFAULT FALSE,
 renovation_year YEAR
 );
CREATE TABLE customer (
customerid INT AUTO_INCREMENT PRIMARY KEY,
customer_name VARCHAR(255) NOT NULL,
address TEXT,
type_of_stay VARCHAR(100),
check_in DATE NOT NULL,
check_out DATE,
amount_paid DECIMAL(10,2)
);
CREATE TABLE time (
date DATE PRIMARY KEY,
day_of_week VARCHAR(20),
day_of_month INT,
week INT,
month INT,
year INT,
holiday BOOLEAN DEFAULT FALSE
);
CREATE TABLE hotel_occupancy_fact (
hotel_id INT,
room_id INT,
customer_id INT,
date DATE,
no_of_occupied_rooms INT,
no_of_vacant_rooms INT,
revenue INT);
CREATE TABLE city (
city_id INT AUTO_INCREMENT PRIMARY KEY,
region VARCHAR(100),
state VARCHAR(100),
country VARCHAR(100),
zipcode VARCHAR(20) );
show tables;
describe hotel_occupancy_fact;



EXPREIMENT 3:
Implementation of all dimension tables and fact table for any case
study. Implementation of OLAP operations: Slice, Dice & Rollup
EXP 3 CODE
INSERT INTO hotel (hotel_name, rooms, star_rating, region, city, state, country) VALUES
 ('Seaside Resort', 100, 4.5, 'Coastal', 'Ocean City', 'Maryland', 'USA'),
 ('Mountain Inn', 50, 4.0, 'Mountain', 'Aspen', 'Colorado', 'USA'),
 ('City Center Hotel', 200, 3.8, 'Urban', 'New York', 'New York', 'USA'),
 ('Lakeside Lodge', 75, 4.2, 'Lakeside', 'Lake Tahoe', 'California', 'USA'),
 ('Desert Oasis Hotel', 120, 5.0, 'Desert', 'Palm Springs', 'California', 'USA');

INSERT INTO room (room_type, max_occupants, no_of_beds, ac, renovation_year) VALUES
 ('Single', 1, 1, TRUE, 2020),
 ('Double', 2, 2, TRUE, 2018),
 ('Suite', 4, 2, TRUE, 2019),
 ('Deluxe', 3, 2, FALSE, 2021),
 ('Family', 5, 3, TRUE, 2022);

INSERT INTO customer (customer_name, address, type_of_stay, check_in, check_out, amount_paid) VALUES
 ('Alice Johnson', '123 Main St, Ocean City, MD', 'Vacation', '2023-07-01', '2023-07-07', 1200.00),
 ('Bob Smith', '456 Elm St, Aspen, CO', 'Business', '2023-08-10', '2023-08-15', 800.00),
 ('Charlie Brown', '789 Maple St, New York, NY', 'Vacation', '2023-09-05', '2023-09-12', 1500.00),
 ('Diana Prince', '101 Pine St, Lake Tahoe, CA', 'Family', '2023-06-15', '2023-06-20', 950.00),
 ('Evan Stone', '202 Oak St, Palm Springs, CA', 'Business', '2023-07-25', '2023-07-30', 1100.00);

INSERT INTO time (date, day_of_week, day_of_month, week, month, year, holiday) VALUES
 ('2023-07-01', 'Saturday', 1, 26, 7, 2023, FALSE),
 ('2023-07-02', 'Sunday', 2, 26, 7, 2023, FALSE),
 ('2023-08-10', 'Thursday', 10, 32, 8, 2023, FALSE),
 ('2023-08-15', 'Tuesday', 15, 33, 8, 2023, FALSE),
 ('2023-09-05', 'Tuesday', 5, 36, 9, 2023, FALSE);


INSERT INTO hotel_occupancy_fact (hotel_id, room_id, customer_id, date, no_of_occupied_rooms, no_of_vacant_rooms, revenue) VALUES
 (1, 1, 1, '2023-07-01', 1, 99, 1200.00),  -- Seaside Resort, Room 1, Alice Johnson
 (1, 2, 1, '2023-07-02', 1, 99, 1200.00),  -- Seaside Resort, Room 2, Alice Johnson
(2, 1, 2, '2023-08-10', 1, 49, 800.00),   -- Mountain Inn, Room 1, Bob Smith
(3, 3, 3, '2023-09-05', 1, 199, 1500.00), -- City Center Hotel, Room 3, Charlie Brown
(4, 4, 4, '2023-06-15', 1, 74, 950.00),   -- Lakeside Lodge, Room 4, Diana Prince
(5, 5, 5, '2023-07-25', 1, 119, 1100.00);  -- Desert Oasis Hotel, Room 5, Evan Stone

INSERT INTO city (region, state, country, zipcode) VALUES
 ('Coastal', 'Maryland', 'USA', '21842'),
 ('Mountain', 'Colorado', 'USA', '81611'),
 ('Urban', 'New York', 'USA', '10001'),
 ('Lakeside', 'California', 'USA', '96150'),
 ('Desert', 'California', 'USA', '92262');

SLICE :
SELECT * from hotel having state="California" order by hotel_id;
ROLL UP :
select sum(rooms) from hotel;
DRILL DOWN :
SELECT *
FROM hotel
WHERE city = 'Palm Springs';
DICE :
select hotel.hotel_name,city.city_id from hotel left join city on city.state=hotel.state;
PIVOT :
SELECT
region,
SUM(CASE WHEN star_rating = 5.0 THEN 1 ELSE 0 END) AS Five_Star,
SUM(CASE WHEN star_rating = 4.0 THEN 1 ELSE 0 END) AS Four_Star,
SUM(CASE WHEN star_rating = 3.0 THEN 1 ELSE 0 END) AS Three_Star
FROM hotel
GROUP BY region;

EXPERIMENT 4:
Write about any one classification algorithm and Implement Classifier algorithms Using open source tools weka.
step 1- go to explorer of weka and below preprocess there will be open file 
      option go on that then select weka folder then data folder and select diabetes
       then go to filter option choose attributeselection (path choose-filters-supervised-attributes-attributeselection)
       and press apply the main attributes will be there from that dataset
step 2- go to classify choose classifier as logistic ( choose-classifer-function-logistic)
step 3- tes options go to percentage split and make it to 80 percent then more options and change outperdictions to plain text then ok
step 4- select (nom)class just below more options
step 5- click on start (hogya bc)

   * predictions terko neeche end me dikhega in form of matrix (2x2) and just upar summary hoga vo padh k dikha dena  **

The logistic classifier algorithm is a supervised learning algorithm used for binary classification tasks. It is based on the logistic function (also known as the sigmoid function), which outputs values between 0 and 1. This makes it ideal for predicting the probability of an event occurring, such as whether an email is spam or not, or if a customer will make a purchase.

In logistic regression, the algorithm takes in a set of input features (independent variables) and applies a linear combination of these features, just like in linear regression. However, instead of predicting a continuous value, it applies the logistic (sigmoid) function to the result of the linear combination. This transforms the output into a probability value between 0 and 1. Based on this probability, the output is then classified into one of two categories: usually by setting a threshold of 0.5. If the probability is greater than or equal to 0.5, the algorithm classifies it as class 1 (positive class), and if it is less than 0.5, it classifies it as class 0 (negative class).

Logistic regression models are trained by optimizing a cost function called the "log-loss" or "cross-entropy loss," which measures the difference between the predicted probabilities and the actual class labels. The parameters of the model are updated using optimization techniques like gradient descent, which aims to minimize the error between the predicted and actual values. Despite its simplicity, the logistic classifier is widely used in various fields such as finance, healthcare, and marketing for classification tasks.


EXPERIMENT 5:
Write about any one association rule mining algorithms and Implement Association Mining Algorithms Using open source tool weka
STEP1:go to explorer of weka and below preprocess there will be open file 
      option go on that then select weka folder then data folder and select breastcancer
       then go to filter option choose attributeselection (path choose-filters-supervised-attributes-attributeselection)
       and press apply the main attributes will be there from that dataset
step 2:attribute select class 
step 3:go to associate
step 4:then go to choose apriori numrules(3)

Apriori Algorithm in Data Mining (DWM - Data Warehousing and Mining)
The Apriori algorithm is a classic algorithm used in association rule mining. It helps find frequent itemsets in large datasets and generate association rules that highlight how frequently items co-occur. Apriori is particularly useful in market basket analysis, where it can uncover relationships between different products frequently bought together.
Key Terminology:
Itemset: A set of items (e.g., {Milk, Bread}).
Support: The frequency of an itemset in the dataset. For example, if {Milk, Bread} appears in 3 out of 10 transactions, its support is 30%.
Support(X) = (Frequency of transactions containing X) / (Total number of transactions)
Confidence: A measure of the reliability of an association rule. It is the ratio of the number of transactions that contain both X and Y to the number of transactions that contain X.
Confidence(X → Y) = Support(X ∪ Y) / Support(X)
Association Rule: A rule that implies a relationship between two itemsets, such as:
X → Y (If X, then Y with confidence and support)
Steps of the Apriori Algorithm:
Generate Candidate Itemsets:
Start with the candidate itemsets of size 1 (single items).
Filter out itemsets whose support is below the minimum support threshold.
Generate Frequent Itemsets:
Extend the frequent itemsets by combining them to create larger itemsets (size 2, 3, etc.).
Prune any itemsets that are not frequent.
Generate Association Rules:
Once frequent itemsets are identified, generate association rules using the minimum confidence threshold.
Repeat:
Continue the process iteratively, generating larger itemsets until no more frequent itemsets can be formed.


EXPERIMENT 6:
Write about any one clustering algorithm and Implement Clustering Algorithms Using open source tools
STEP1:go to explorer of weka and below preprocess there will be open file 
      option go on that then select weka folder then data folder and select diabetes
       then go to filter option choose attributeselection (path choose-filters-supervised-attributes-attributeselection)
       and press apply the main attributes will be there from that dataset
step 2:attribute select class 
step 3:go to cluster then go to clusterer and choose simple k means and click on it and cluster count based on no of attributes(here 3) and use training set ..ignore attributes class and start
step 4:then go to classes to cluster evaluation and ignore attribute class and start

K-Means Algorithm in Data Mining (DWM - Data Warehousing and Mining)
The K-Means algorithm is a widely used clustering technique in data mining. It partitions a dataset into a predefined number of K clusters. The goal is to group data points that are similar to each other into the same cluster, while maximizing the differences between clusters. K-Means is commonly used in market segmentation, pattern recognition, and image compression.
Key Terminology:
Cluster: A group of data points that share similar features.
Centroid: The center of a cluster, calculated as the mean of all points in the cluster.
Euclidean Distance: A measure of the distance between two data points. In a 2D space, it’s calculated as:
Distance = √[(x2 - x1)² + (y2 - y1)²]
Objective: Minimize the sum of squared distances between each point and its assigned cluster’s centroid.
Steps of the K-Means Algorithm:
Choose the Number of Clusters (K):
The user specifies the number of clusters, K, they want to group the data into.
Initialize Centroids:
Randomly select K points from the dataset as the initial centroids.
Assign Data Points to Nearest Centroid:
Calculate the Euclidean distance between each data point and the K centroids.
Assign each point to the closest centroid, forming K clusters.
Update Centroids:
For each cluster, recalculate the centroid as the mean of all the data points in that cluster.
The centroid is the average position of all points in the cluster.
Repeat:
Reassign points to the nearest centroid based on the updated centroids.
Update the centroids again.
Repeat this process until the centroids no longer change (convergence) or a maximum number of iterations is reached.




EXPERIMENT 7:
Implementation of any one classifier using languages like JAVA/ python
# Assigning features and label variables 
Weather = ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 
           'Overcast', 'Sunny', 'Sunny', 'Rainy', 'Sunny', 'Overcast', 
           'Overcast', 'Rainy'] 
Temp = ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 
        'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'] 
Play = ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 
        'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'] 
# Import LabelEncoder 
from sklearn  import preprocessing 
# Create labelEncoder 
le = preprocessing.LabelEncoder()

# Converting string labels into numbers 
weather_encoded = le.fit_transform(Weather) 
temp_encoded = le.fit_transform(Temp) 
label = le.fit_transform(Play) 

print("Weather Encoded:", weather_encoded) 
print("Temp Encoded:", temp_encoded) 
print("Play Encoded:", label) 

# Combining weather and temp into single list of tuples
features = list(zip(weather_encoded, temp_encoded)) 
print("Features:", features) 

# Import Gaussian Naive Bayes model 
from sklearn.naive_bayes import GaussianNB 

# Create a Gaussian Classifier 
model = GaussianNB() 

# Train the model using the training sets 
model.fit(features, label) 

# Predict Output 
predicted = model.predict([[0, 2]])  # 0: Overcast, 2: Mild 
print("Predicted Value:", le.inverse_transform(predicted))  # Decode back to original labels


EXPERIMENT 8:
Implementation of any one clustering algorithm using languages like JAVA/ python
import math

def main():
    numbers = int(input("How many numbers you want to enter: "))
    k = int(input("Enter value of k: "))

    # Get numbers
    nums = []
    for i in range(numbers):
        num = int(input(f"Enter Number {i+1}: "))
        nums.append(num)

    # Set initial values for kvals
    kvals = nums[:k]  # Take the first k elements as initial centroids
    prev_kvals = [0] * k
    steps = 1
    addition = [[0 for _ in range(numbers)] for _ in range(k)]
    groups = [[None for _ in range(numbers)] for _ in range(k)]

    # Show values entered by the user
    print("You have entered:", ', '.join(map(str, nums)))

    # While loop for iterations
    ok = True
    while ok:
        print(f"\nIteration Number: {steps}")

        # Make calculations
        for i in range(k):
            for j in range(numbers):
                addition[i][j] = abs(kvals[i] - nums[j])

        # Make groups of numbers
        for i in range(numbers):
            min_val = math.inf
            groupnum = -1
            value = nums[i]
            for j in range(k):
                if addition[j][i] < min_val:
                    min_val = addition[j][i]
                    groupnum = j
            groups[groupnum][i] = value

        # Show results of calculations
        print("\nCalculations")
        for i in range(numbers):
            for j in range(k):
                print(f"{addition[j][i]}\t", end="")
            print()

        # Show groups and get new kvals (centroids)
        print("\nGroups")
        for i in range(k):
            group_values = []
            sum_vals = 0
            count = 0
            print(f"Group {i+1}: ", end="")
            for j in range(numbers):
                if groups[i][j] is not None:
                    print(f"{groups[i][j]}\t", end="")
                    sum_vals += groups[i][j]
                    count += 1
            prev_kvals[i] = kvals[i]
            if count > 0:
                kvals[i] = sum_vals // count
            print(f"=\t{kvals[i]}")

        # Empty the groups
        groups = [[None for _ in range(numbers)] for _ in range(k)]

        # Check condition of termination
        ok = False
        for i in range(k):
            if prev_kvals[i] != kvals[i]:
                ok = True

        steps += 1

if _name_ == "_main_":
    main()




### Experiment 7: *Classification using Gaussian Naive Bayes*

The *Gaussian Naive Bayes* classifier is based on Bayes' Theorem and assumes that features follow a normal (Gaussian) distribution. It is commonly used for classification tasks where the input features are continuous. In this experiment, weather conditions and temperature data are used to predict whether to play a sport or not. The features, such as "Weather" and "Temp," are encoded into numerical form using *LabelEncoder*. The model is trained on this dataset, and then a prediction is made for specific input conditions. Gaussian Naive Bayes is known for its simplicity, efficiency, and effective handling of continuous features.

### Experiment 8: *Clustering using K-Means*

The *K-Means* clustering algorithm partitions data into *K clusters* by iteratively adjusting the centroids of each cluster. The algorithm begins by selecting K random centroids, assigns data points to the nearest centroid, and recalculates centroids until convergence. This algorithm is widely used in data mining tasks like customer segmentation, pattern recognition, and image compression. In this experiment, a set of numerical data is grouped into clusters based on their distance from the centroids. Each iteration reassigns points and recalculates centroids until the centroids stabilize.
Editor is loading...
Leave a Comment