Parsing Excel formulas in PySpark

This blog shows a couple of examples of how you can use the Spark Excel Library in PySaprk to parse Excel formulas:

Example 1
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create a SparkSession
spark = SparkSession.builder.appName("Excel Formulas").getOrCreate()
# Read an Excel file into a DataFrame
df = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true").option("addColorColumns", "False").load("path/to/excel/file.xlsx")
# Define a function to parse formulas
def parse_formulas(col_name):
formulas = df.select(col_name).filter(col(col_name).rlike("^=.*"))
return formulas
# Use the function to parse formulas in column "A"
formulas_col_A = parse_formulas("A")
formulas_col_A.show()

In this example, the code uses the “com.crealytics.spark.excel” library to read an Excel file into a DataFrame. Then it defines a function “parse_formulas” to filter out the rows in a given column that contain formulas (i.e. start with “=”) and returns those rows as a new DataFrame. Finally, it uses the function to parse formulas in column “A” of the Excel file and display the results.

Example 2
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create a SparkSession
spark = SparkSession.builder.appName("Excel Formulas").getOrCreate()
# Read the Excel file
df = spark.read.format("com.crealytics.spark.excel")\
.option("header", "true")\
.option("treatEmptyValuesAsNulls", "true")\
.option("inferSchema", "true")\
.option("addColorColumns", "False")\
.load("path/to/your/excel/file.xlsx")
# Create a new column that extracts the formulas from the specified column
df = df.withColumn("formulas", col("formula_column").rlike("^=[A-Z]+"))
# Show the result
df.select("formulas").show()

In this example, the df.withColumn("formulas", col("formula_column").rlike("^=[A-Z]+")) the statement creates a new column called “formulas” that contains only the formulas from the specified column, “formula_column”, using the rlike() function to match the regular expression “^=[A-Z]+”. The ^= is used to match the first character of the formula, which is “=” and the [A-Z]+ is used to match any uppercase letters that come after the “=”. The show() method at the end of the code snippet displays the result.

You can adjust this code to match your excel sheet and columns accordingly. Also, you can change the regular expression to match your formula if it has a different format.

You May Also Like

About the Author: Irtaza

Leave a Reply

Your email address will not be published. Required fields are marked *

Bitnami