PySpark Explained: The InferSchema Problem

Image by AI (Dalle-3)

Think before using this common option when reading large CSV’s

Thomas Reid

Published in

Towards Data Science

10 min read

3 hours ago

—

Whether you’re a data scientist, data engineer, or programmer, reading and processing CSV data will be one of your bread-and-butter skills for years.

Most programming languages can, either natively or via a library, read and write CSV data files, and PySpark is no exception.

It provides a very useful spark.read function. You’ll probably have used this function along with its inferschema directive many times. So often in fact that it almost becomes habitual.

If that’s you, in this article, I hope to convince you that this is usually a bad idea from a performance perspective when reading large CSV files, and I’ll show you what you can do instead.

Firstly, we should examine where and when inferschema is used and why it’s so popular.

The where and when is easy. Inferschema is used explicitly as an option in the spark.read function when reading CSV files into Spark Dataframes.

You might ask, “What about other types of files”?

The schema for Parquet and ORC data files is already stored within the files. So explicit schema inference is not required.

Productionize LLM RAG App in Django — Part I: Celery

Automate Pinecone Daily Upsert Task with Celery and Slack monitoring Wen Yang · Follow Published in Towards Data Science · 8 min read · 3

April 15, 2024

Researchers build microrobots to remove microplastics from water

Listen to this article [embedded content] When old food packaging, discarded children’s toys and other mismanaged plastic waste break down into microplastics, they become even

May 12, 2024

Foresight to collaborate with KONEC on autonomous vehicle concept – The Robot Report

Listen to this article Foresight says its ScaleCam system can generate high-quality depth maps. | Source: Foresight Foresight Autonomous Holdings Ltd. last week announced that

June 3, 2024

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

With expert analysis, comprehensive market coverage, and actionable insights, our newsletter equips you with the knowledge & tools necessary to make informed decisions & maximize your potential returns in the dynamic world of future tech stocks.