Parsing Excel formulas in PySpark

This blog shows a couple of examples of how you can use the Spark Excel Library in PySaprk to parse Excel formulas:

Example 1

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import col

	# Create a SparkSession
	spark = SparkSession.builder.appName("Excel Formulas").getOrCreate()

	# Read an Excel file into a DataFrame
	df = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true").option("addColorColumns", "False").load("path/to/excel/file.xlsx")

	# Define a function to parse formulas
	def parse_formulas(col_name):
	formulas = df.select(col_name).filter(col(col_name).rlike("^=.*"))
	return formulas

	# Use the function to parse formulas in column "A"
	formulas_col_A = parse_formulas("A")
	formulas_col_A.show()

view raw pyspark_parse_excel_formulas.py hosted with

by GitHub

In this example, the code uses the “com.crealytics.spark.excel” library to read an Excel file into a DataFrame. Then it defines a function “parse_formulas” to filter out the rows in a given column that contain formulas (i.e. start with “=”) and returns those rows as a new DataFrame. Finally, it uses the function to parse formulas in column “A” of the Excel file and display the results.

Example 2

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import col

	# Create a SparkSession
	spark = SparkSession.builder.appName("Excel Formulas").getOrCreate()

	# Read the Excel file
	df = spark.read.format("com.crealytics.spark.excel")\
	.option("header", "true")\
	.option("treatEmptyValuesAsNulls", "true")\
	.option("inferSchema", "true")\
	.option("addColorColumns", "False")\
	.load("path/to/your/excel/file.xlsx")

	# Create a new column that extracts the formulas from the specified column
	df = df.withColumn("formulas", col("formula_column").rlike("^=[A-Z]+"))

	# Show the result
	df.select("formulas").show()

view raw excel_formulas.py hosted with

by GitHub

In this example, the df.withColumn("formulas", col("formula_column").rlike("^=[A-Z]+")) the statement creates a new column called “formulas” that contains only the formulas from the specified column, “formula_column”, using the rlike() function to match the regular expression “^=[A-Z]+”. The ^= is used to match the first character of the formula, which is “=” and the [A-Z]+ is used to match any uppercase letters that come after the “=”. The show() method at the end of the code snippet displays the result.

You can adjust this code to match your excel sheet and columns accordingly. Also, you can change the regular expression to match your formula if it has a different format.

Irtaza Hassan

irtaza@gmail.com

Parsing Excel formulas in PySpark

Example 1

Example 2

About the Author: Irtaza

Leave a Reply Cancel reply

Example 1

Example 2

You May Also Like

Analysing UK House Price Data with Spark, Athena and Tableau

About the Author: Irtaza

Leave a Reply Cancel reply