Think before using this common option when reading large CSV’s
Whether you’re a data scientist, data engineer, or programmer, reading and processing CSV data will be one of your bread-and-butter skills for years.
Most programming languages can, either natively or via a library, read and write CSV data files, and PySpark is no exception.
It provides a very useful spark.read
function. You’ll probably have used this function along with its inferschema
directive many times. So often in fact that it almost becomes habitual.
If that’s you, in this article, I hope to convince you that this is usually a bad idea from a performance perspective when reading large CSV files, and I’ll show you what you can do instead.
Firstly, we should examine where and when inferschema is used and why it’s so popular.
The where and when is easy. Inferschema is used explicitly as an option in the spark.read function when reading CSV files into Spark Dataframes.
You might ask, “What about other types of files”?
The schema for Parquet and ORC data files is already stored within the files. So explicit schema inference is not required.