How to Train a Decision Tree Classifier… In SQL

SQL can now replace Python for most supervised ML tasks. Should you make the switch?

Dario Radečić

Published in

Towards Data Science

8 min read

10 hours ago

—

Photo by Resource Database on Unsplash

When it comes to machine learning, I’m an avid fan of attacking data where it lives. 90%+ of the time, that’s going to be a relational database, assuming we’re talking about supervised machine learning.

Python is amazing, but pulling dozens of GB of data whenever you want to train a model is a huge bottleneck, especially if you need to retrain them frequently. Eliminating data movement makes a lot of sense. SQL is your friend.

For this article, I’ll use an always-free Oracle Database 21c provisioned on Oracle Cloud. I’m not sure if you can translate the logic to other database vendors. Oracle works like a charm, and the database you provision won’t cost you a dime — ever.

Dataset Loading and Preprocessing

I’ll leave the Python vs. Oracle for machine learning on huge dataset comparison for some other time. Today, it’s all about getting back to basics.

I’ll use the following dataset today:

Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. University of California, Irvine, School of Information and Computer Sciences. Retrieved from https://archive.ics.uci.edu/ml/datasets/iris. Licensed under Creative Commons Attribution 4.0 license.

So download it to follow along, and make sure you have a connection established to your database instance. Tools like SQL Developer or Visual Studio Code can do that for you.

How to Create a Database Table

The following code snippet will create the iris table. The id column is mandatory, as Oracle will need it behind the scenes:

create sequence seq_iris;create table iris(
    id number default seq_iris.nextval,
    sepal_length number(2, 1),
    sepal_width number(2, 1),
    petal_length number(2, 1),
    petal_width number(2, 1),
    species varchar2(15)
);

Once the table is created, it’s time to load the data.

How to Load CSV Data Into the Table

TSMC to begin construction on 3rd fab in Arizona in mid-2025, entire year earlier than planned

TL;DR: TSMC plans to build a third fab in Arizona by mid-2025, a year earlier than planned, and may establish a new CoWoS advanced packaging

February 16, 2025

3thix partners with Avalanche on web3 gaming ad data

GamesBeat Next is almost here! GB Next is the premier event for product leaders and leadership in the gaming industry. Coming up October 28th and

October 3, 2024

HBM Revenue Poised To Cross $10B as SK hynix Predicts First Double-Digit Revenue Share

Offering some rare insight into the scale of HBM memory sales – and on its growth in the face of unprecedented demand from AI accelerator

March 28, 2024

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

With expert analysis, comprehensive market coverage, and actionable insights, our newsletter equips you with the knowledge & tools necessary to make informed decisions & maximize your potential returns in the dynamic world of future tech stocks.