Weather Prediction with Machine Learning

68 View(s)

0 Like(s)

0 Comment(s)

EE

Eddy Ejembi

Author

Introduction

In today's world, accurate weather predictions are essential for planning daily activities, travel, and even long-term decision-making. Leveraging the power of Python and machine learning, I embarked on a journey to create a weather prediction project that offers insights into the future weather conditions of various cities. This project not only showcases the capabilities of machine learning but also provides a practical application of AI in our daily lives.

The Project at a Glance

This weather prediction project enables users to forecast weather conditions for specific cities and dates. It's user-friendly and accessible through a web-based interface. Users can select their desired city from a list, choose a date, and receive accurate predictions for that day's weather. The available cities for prediction include Lagos, Port Harcourt, Kano, Abuja, Ibadan, and Ota.

Demo of project

Data Acquisition

The first step in this project was to acquire the data. The data was extracted from the weatherapi.com. The features/data extracted from the API were:

The temperature (in degree Celsius)
The humidity
The wind direction (in km/h)
Precip_mm
Atmospheric Pressure
Visibility
Dew Point
Wind Gust
Cloud Cover (%)
UV Index
Condition

To extract the feature I specified the locations and date for which I want to extract from the API. So I looped through the six cities passing their latitude and longitude as a parameter to the API. I also extracted for every hour from 2022-06-01 to 2023-05-12. After extracting these features, I saved the data in a csv file so I can perform exploratory data analysis on them before training the model. The file contained a total row of 49687 which indicates the total number of hours extracted from the API.

import requests
import pandas as pd

# WeatherAPI.com API endpoint for historical weather data
url = "<https://api.weatherapi.com/v1/history.json>"

# Your WeatherAPI.com API key
api_key = "<API_KEY>"

# List of cities to obtain weather data for
cities = ["Lagos", "Port Harcourt", "Kano", "Abuja", "Ibadan", "Ota"]

# Date range for which to obtain historical weather data
start_date = "2022-06-01"
end_date = "2023-05-12"

# List to store weather data for each city
city_data = []

# Loop through cities and obtain weather data for each city and date
for city in cities:
    # Latitude and longitude for the cityif city == "Lagos":
        lat, lon = "6.5244", "3.3792"elif city == "Port Harcourt":
        lat, lon = "4.8156", "7.0498"elif city == "Kano":
        lat, lon = "12.0022", "8.5927"elif city == "Abuja":
        lat, lon = "9.0579", "7.4951"elif city == "Ibadan":
        lat, lon = "7.3776", "3.9470"elif city == "Ota":
        lat, lon = "6.6804", "3.2356"

    # List to store weather data for the city
    city_data_temp = []
    
    # Loop through dates and hours and obtain weather data for each hourfor date in pd.date_range(start=start_date, end=end_date, freq="H"):# Format date as string
        date_str = date.strftime("%Y-%m-%d %H:%M:%S")
        
        # Parameters for API request
        params = {"key":api_key, "q": lat + "," + lon, "dt":date_str}
        
        # Make API request
        response = requests.get(url, params=params, verify=False)
        
        # Parse JSON data
        data = response.json()
        
        # Extract relevant weather features from data
        weather_features = {
            "city": city,
            "date": date_str,
            "temp_c": data["forecast"]["forecastday"][0]["hour"][0]["temp_c"],
            "humidity": data["forecast"]["forecastday"][0]["hour"][0]["humidity"],
            "wind_kmph": data["forecast"]["forecastday"][0]["hour"][0]["wind_kph"],
            "precip_mm": data["forecast"]["forecastday"][0]["hour"][0]["precip_mm"],
            "Atmospheric Pressure": data["forecast"]["forecastday"][0]["hour"][0]["pressure_mb"],
            "Visibility": data["forecast"]["forecastday"][0]["hour"][0]["vis_km"],
            "Dew Point": data["forecast"]["forecastday"][0]["hour"][0]["dewpoint_c"],
            "Wind Gust": data["forecast"]["forecastday"][0]["hour"][0]["wind_kph"],
            "Cloud Cover (%)": data["forecast"]["forecastday"][0]["hour"][0]["cloud"],
            "UV Index": data["forecast"]["forecastday"][0]["day"].get("uv", "N/A"),
            "condition": data["forecast"]["forecastday"][0]["hour"][0]["condition"]["text"]
        }

        # Append weather features to city data list
        city_data_temp.append(weather_features)
    
    # Append city data to overall city data list
    city_data += city_data_temp

# Convert city data list to Pandas DataFrame
weather_df = pd.DataFrame(city_data)
print(weather_df.head(10))

# Save DataFrame to CSV file
weather_df.to_csv("weather_data.csv", index=False)

Data Preprocessing

After extracting and saving the data, the next step was to preprocess it and put it in a format suitable for training. The data (in csv format) was loaded as a DataFrame using Pandas.

import pandas as pd
#Read and disply the data
data = pd.read_csv("weather_data.csv")

The first step in processing the data was to get general information about the data and also check the weather conditions available in the dataset since that is the feature to be predicted. The conditions available in the dataset along with their count is displayed below.

Encoding Categorical Data

I used Label Encoder, a Scikit-Learn Module for preprocessing, to converts categorical columns to numerical columns. This was done because Machine learning models require numerical inputs and training a model with categorical columns will yield a poor performance of the model.

#Import the Label Encoder library
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the "city" column
data['city_encoded'] = label_encoder.fit_transform(data['city'])

# Fit and transform the "condition" column
data['condition_encoded'] = label_encoder.fit_transform(data['condition'])

Time Transformation

To capture the nature of time-related data (months and hours), I employed a unique approach by converting the date-time column which contains the date and time (in 24 hour format) to a datetime field using pandas, and created new columns to hold the year, month, day, and hour. After that performed cyclical encoding on the month and hour field.

When encoding the month feature, it's important to consider that months follow a cyclical pattern where December is followed by January, forming a circular relationship. If we encode the months as simple numerical values (e.g., 1 for January, 2 for February, and so on), it may introduce a linear relationship between the months, which could lead to incorrect interpretations by the model.

By applying sine and cosine transformations, we can capture the cyclical nature of these features in a continuous and meaningful way. The sine and cosine functions have periodic properties, which allow them to represent cyclic patterns. When encoding the month or hour, I mapped them onto the unit circle by converting them to radians and applying sine and cosine transformations. This results in two new features that capture the cyclic information: one representing the position along the circle (e.g., the sine of the angle) and the other representing the orientation (e.g., the cosine of the angle).

This cyclical encoding ensures that the model can learn the cyclical patterns and relationships in the data, enabling it to make accurate predictions. It helps prevent issues such as linear assumptions or incorrect ordering of the months or hours. I created new columns to add the cosine and sine transformation of the month and hour.

# Convert date column to datetime type
data['date'] = pd.to_datetime(data['date'])

# Extract useful features (year, month, day & hour)
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day
data['hour'] = data['date'].dt.hour

# Perform cyclical encoding on month and hour features
data['month_sin'] = np.sin(2 * np.pi * data['month'] / 12)
data['month_cos'] = np.cos(2 * np.pi * data['month'] / 12)
data['hour_sin'] = np.sin(2 * np.pi * data['hour'] / 24)
data['hour_cos'] = np.cos(2 * np.pi * data['hour'] / 24)

Feature Scaling

Next, Standard Scaler, a scikit-learn module was applied to normalize the features, bringing them to a common scale. The features to scale were: ["Atmospheric Pressure", "Cloud Cover (%)", "Dew Point", "UV Index", "Visibility", "Wind Gust", "humidity", "precip_mm", "temp_c", "wind_kmph"].

Scaling the features helps ensure that they have a similar range and magnitude, which can improve the convergence and performance of the neural network. The method used to scale the data is called Standardization. This method transforms the features to have zero mean and unit variance. It subtracts the mean from each feature and divides by the standard deviation. The formula for standardization is:

X_scaled = (X - mean(X)) / std(X)

By scaling or normalizing the features, it helps the neural network learn more effectively and avoid issues such as unequal weight updates or dominance of certain features due to their larger scales.

#Normalizing the Data using Standization
from sklearn.preprocessing import StandardScaler

#Features to be standardized
features_to_scale = ["Atmospheric Pressure", "Cloud Cover (%)", "Dew Point", "UV Index",
                     "Visibility", "Wind Gust", "humidity", "precip_mm", "temp_c", "wind_kmph"]

#StandardScaler object
scaler = StandardScaler()

# Fit the scaler on the selected features
scaler.fit(data[features_to_scale])

# Transform the selected features using the scaler
data[features_to_scale] = scaler.transform(data[features_to_scale])

After scaling and processing the data, I created a new DataFrame from the existing one but dropped the city and condition columns and saved it in a new csv file.

#Saving the new datafram to a CSV file
training_data.to_csv('Training_data.csv', index=False)

Model Building

After acquiring the data and preprocessing it, the next step was to build the model. The framework used for building the model was TensorFlow, a deep learning framework which uses Keras at its Backend. For building the model I first experimented with the deep learning architecture using the Long Short Term Memory (LSTM) Algorithm. Reason for choosing this algorithm was that LSTM is specifically designed to handle sequential or time-dependent data. It can capture patterns and dependencies over time, which can be beneficial for predicting weather conditions based on historical hourly data. LSTM models are capable of learning and retaining information from previous time steps, allowing them to capture long-term dependencies and make predictions based on the temporal patterns in the data. While training the architecture, I used the Adaptive Moment Estimation (ADAM) optimizer. I also used the RELU and SIGMOID activation function for different layers of the architecture. I also use a Callback function, which uses early stopping to stop the model lost when the validation loss is no longer reduced.

#import required libraries
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, BatchNormalization, Flatten
from keras.callbacks import EarlyStopping
"""
...Some processing steps
"""
# Build the LSTM model Architecture
model_dense = Sequential()
model_dense.add(LSTM(units=64, return_sequences=True, input_shape=(1, X_train2.shape[2])))
model_dense.add(Dropout(0.2))
model_dense.add(LSTM(units=64, return_sequences=True))
model_dense.add(Dropout(0.2))
model_dense.add(LSTM(units=64, return_sequences=True))
model_dense.add(Dropout(0.2))
model_dense.add(Dense(units=32, activation='relu'))
model_dense.add(Dense(units=1, activation='sigmoid'))

import tensorflow as tf
adam = tf.keras.optimizers.Adam(learning_rate=0.001)

model_dense.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

# Define EarlyStopping callback
early_stopping = EarlyStopping(patience=10, monitor='val_loss', restore_best_weights=True)

history = model_dense.fit(X_train2, y_train, batch_size=32, epochs=100, validation_data=(X_test2, y_test), callbacks=[early_stopping])

After training, the deep learning process achieved an accuracy of 54% on the third data.

Not satisfied with this result, I decided to experiment with other traditional machine learning algorithms. The algorithm chosen were:

Logistic regression
XGBoost Classifier
Support Vector Classifier (SVC)
KNeighbor Classifier
AdaBoost Classifier
GridSearchCV

Logistic Regression

Using the Logistic Regression algorithm, I built another model which achieved an accuracy of 87% on the test data. The model performed quite well on the test data.

XGBoost Classifier

I then built another model with the XGBoost Classifier algorithm. This model achieved an accuracy of 100% on the test data. The model performed too well leading to the suspicion and conclusion that the model must have been overfitted.

Support Vector Classifier

Experimenting with the Support Vector Classifier, the model trained using this algorithm achieved an accuracy of 87% on the test data. This model performed quite well too on the test data.

KNeighborsClassifier

Using the KNeighbors Classifier algorithm, the model built achieved an accuracy of 88% on the test data. The model performed well on the test data.

AdaBoost Classifier

The AdaBoost Classifier model achieved an accuracy of 64% on the test data. The model did not perform too well.

#import required libraries
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import metrics
from mlxtend.plotting import plot_confusion_matrix

#Load machine learning algorithms
models = [LogisticRegression(), XGBClassifier(), SVC(kernel='rbf', probability=True), KNeighborsClassifier(n_neighbors=5), AdaBoostClassifier(n_estimators=100, random_state=42)]

for i in range(5):
  models[i].fit(X_train, y_train)

  print(f'{models[i]} : ')

  train_preds = models[i].predict(X_train)
  print('Training Accuracy : ', metrics.accuracy_score(y_train, train_preds))

  val_preds = models[i].predict(X_test)
  print('Validation Accuracy : ', metrics.accuracy_score(y_test, val_preds))
  print()
  print("Classification Report:")
  print(metrics.classification_report(y_test, val_preds))

GridSearchCV

The GridSearchCV model which uses AdaBoost Classifier as its estimator, achieved an accuracy of 64% on the test data. The model did not perform very well.

from sklearn.model_selection import GridSearchCV
parameters = {
    'learning_rate': [1, 2, 3],
    'n_estimators': [100, 500, 1000]
}
cv = GridSearchCV(models[4], param_grid=parameters, scoring='f1_micro', n_jobs=-1, verbose=3)
cv.fit(X_train, y_train)

train_preds = cv.predict(X_train)
print('Training Accuracy : ', metrics.accuracy_score(y_train, train_preds))

val_preds = cv.predict(X_test)
print('Validation Accuracy : ', metrics.accuracy_score(y_test, val_preds))
print()
print("Classification Report:")
print(metrics.classification_report(y_test, val_preds))

After observing the performance of the models, I decided to employ another technique called Ensembling with the hope to achieve a better performing model.

I tried two ensembling methods called Bagging Ensembling, and Voting Ensembling.

Bagging (Bootstrap Aggregating) Ensembling

Train multiple models on different subsets of the training data (randomly sampled with replacement) and average their predictions. This involves taking the average of the predictions made by each model and using that as the final prediction.

In this case I used the Deep Learning model, Logistic Regression model, Support Vector Classifier model, and KNeighborsClassifier model.

#making predictions
log_pred = models[0].predict([[-0.342235,	-0.433656,	0.854212,	-0.163094,	0.327851,	0.924915,	0.264319,	-0.245888,	1.611356,	0.924915,	4,	5,	11,	23,	5.000000e-01,	-0.866025,	-0.258819,	0.965926]])
svc_pred = models[2].predict([[-0.342235,	-0.433656,	0.854212,	-0.163094,	0.327851,	0.924915,	0.264319,	-0.245888,	1.611356,	0.924915,	4,	5,	11,	23,	5.000000e-01,	-0.866025,	-0.258819,	0.965926]])
knn_pred = models[3].predict([[-0.342235,	-0.433656,	0.854212,	-0.163094,	0.327851,	0.924915,	0.264319,	-0.245888,	1.611356,	0.924915,	4,	5,	11,	23,	5.000000e-01,	-0.866025,	-0.258819,	0.965926]])
dnn = model_dense.predict(deep_test)

ensemble_predictions = np.mean([log_pred, svc_pred, knn_pred, dnn], axis=0)

After ensembling, the result produced was manageable and better than most of the models trained.

Voting Ensembling

Combine the predictions of multiple models and select the most common prediction (for classification problems) or average the predictions (for regression problems).This involves letting each model make a prediction, and then choosing the prediction that received the most votes.

I applied the voting ensembling method in building a new model from the already existing models. The models used were: Logistic Regression model, Support Vector Classifier model, and KNeighborsClassifier model.

After building this model, it achieved a 90% accuracy on the test data. When tested on other data it predicted correctly most times.

from sklearn.ensemble import VotingClassifier

# Create the ensemble by specifying the models and voting strategy
ensemble = VotingClassifier(
    estimators=[
        ('logistic_regression', models[0]),
        ('svc', models[2]),
        ('knn', models[3])
    ],
    voting='hard'  # Use 'hard' voting for majority voting
)

# Fit the ensemble model on the training data
ensemble.fit(X_train, y_train)

# Make predictions with the ensemble model
ensemble_predictions = ensemble.predict(X_test)

# Evaluate the ensemble predictions
accuracy = metrics.accuracy_score(y_test, ensemble_predictions)
classification_report = metrics.classification_report(y_test, ensemble_predictions)

print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report)
print("Confusion Matix: ")

Observations

The Voting Ensemble model seems to be the best performing model so far. Other models on their own did not perform really well, especially the deep learning, XGBoost, AdaBoost, and GridSearchCV. However, Logistic Regression, Support Vector Classifier, and KNeighborsClassifier performed quite well. In conclusion it will be advisable to use a traditional machine learning algorithm or an ensemble technique rather than to use deep learning for weather forecasting and prediction projects.

With all this, I saved the models as pickle files so they can be deployed and used on the web app.

Deployment

After building the machine learning models, the next step was deployment. For deployment I used Streamlit. Streamlit is a free and open-source framework to rapidly build and share beautiful machine learning and data science web apps. It is a Python-based library specifically designed for machine learning engineers.

In building the web app for deployment, I extracted the forecast weather for the future days, collecting the features which were used in training the model. After collecting those data, I processed them into the format the model required using the methods as earlier, and sent the data to the model to make predictions. When the model made the prediction for a day, I created a DataFrame which contained every hour of that day as well as the prediction made for that day. I then employed a programming technique to divide the day into sections: Night (00:00-05:00), Morning (06:00-11:00), Afternoon (12:00-17:00), and Evening (16:00-23:00) and then calculated the mean of all the numerical columns in that section to produce a general prediction for that section. I then decoded the city and conditions which were earlier encoded for model training. After decoding them, I displayed the prediction of each section of that day along with the DataFrame of the prediction of every hour of that day on the web app.

💡 Note: The user can only select cities and dates available in the option. For the date, the API allows access to only 3 days forecast, this is because of the free plan. For a paid plan you will have access to more days forecast.

After building the web app, I deployed it on the Streamlit cloud, which is a free cloud deployment platform for Streamlit apps.

Conclusion

This weather prediction project demonstrates how Python, machine learning, and ensemble techniques can deliver precise weather forecasts. By blending tried-and-true machine learning methods and thoughtful ensembling, I created a dependable weather prediction tool. This project illustrates how AI can improve everyday decision-making.

The web app can be accessed through this URL: https://weather-prediction.streamlit.app.

Access the code here: EddyEjembi/Weather-Prediction-Project