Spark has become a standard for performing analysis on huge amounts of data due to its distributed nature. SparkSQL evolved as a necessary component of Spark due to the need for working with structured data.
There are many times when there is a need to query data in Spark with SQL commands. Doing so isn’t complicated at all, and in fact you can use the same SQL syntax you would use in a relational database as long as you execute a few Spark-specific commands first. Here’s what it looks like:
First we’ll need to create our Spark Session and application:
In [16]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“example”).getOrCreate()
Next we can load a csv. In the code below we’ll create a dataframe named “df”, use the Databricks CSV loader, specify that the data has a header and uses a comma for a delimiter, and give the path to the file. I like to use this version of the read command because it allows the user to specify the delimiter being used (not all delimited files use commas).
In [17]:
df = spark.read.format(“com.databricks.spark.csv”).option(“header”, “true”).option(“delimiter”,”,”).load(“/Users/randalking/Desktop/Mac/Superstore.csv”)
In [18]:
df.show(10)
+——+——–+———-+————–+————–+———+——–+————–+——-+———-+————-+——————+——–+——-+—————-+—————-+——————–+——————–+—————–+——————-+———+
|Row ID|Order ID|Order Date|Order Priority|Order Quantity| Sales|Discount| Ship Mode| Profit|Unit Price|Shipping Cost| Customer Name|Province| Region|Customer Segment|Product Category|Product Sub-Category| Product Name|Product Container|Product Base Margin|Ship Date|
+——+——–+———-+————–+————–+———+——–+————–+——-+———-+————-+——————+——–+——-+—————-+—————-+——————–+——————–+—————–+——————-+———+
| 1| 3| 10/13/10| Low| 6| 261.54| 0.04| Regular Air|-213.25| 38.94| 35|Muhammed MacIntyre| Nunavut|Nunavut| Small Business| Office Supplies|Storage & Organiz…|Eldon Base for st…| Large Box| 0.8| 10/20/10|
| 49| 293| 10/1/12| High| 49| 10123.02| 0.07|Delivery Truck| 457.81| 208.16| 68.02| Barry French| Nunavut|Nunavut| Consumer| Office Supplies| Appliances|”1.7 Cubic Foot C…| Jumbo Drum| 0.58| 10/2/12|
| 50| 293| 10/1/12| High| 27| 244.57| 0.01| Regular Air| 46.71| 8.69| 2.99| Barry French| Nunavut|Nunavut| Consumer| Office Supplies|Binders and Binde…|Cardinal Slant-D®…| Small Box| 0.39| 10/3/12|
| 80| 483| 7/10/11| High| 30|4965.7595| 0.08| Regular Air|1198.97| 195.99| 3.99| Clay Rozendal| Nunavut|Nunavut| Corporate| Technology|Telephones and Co…| R380| Small Box| 0.58| 7/12/11|
| 85| 515| 8/28/10| Not Specified| 19| 394.27| 0.08| Regular Air| 30.94| 21.78| 5.94| Carlos Soltero| Nunavut|Nunavut| Consumer| Office Supplies| Appliances|Holmes HEPA Air P…| Medium Box| 0.5| 8/30/10|
| 86| 515| 8/28/10| Not Specified| 21| 146.69| 0.05| Regular Air| 4.43| 6.64| 4.95| Carlos Soltero| Nunavut|Nunavut| Consumer| Furniture| Office Furnishings|G.E. Longer-Life …| Small Pack| 0.37| 8/30/10|
| 97| 613| 6/17/11| High| 12| 93.54| 0.03| Regular Air| -54.04| 7.3| 7.72| Carl Jackson| Nunavut|Nunavut| Corporate| Office Supplies|Binders and Binde…|Angle-D Binders w…| Small Box| 0.38| 6/17/11|
| 98| 613| 6/17/11| High| 22| 905.08| 0.09| Regular Air| 127.7| 42.76| 6.22| Carl Jackson| Nunavut|Nunavut| Corporate| Office Supplies|Storage & Organiz…|SAFCO Mobile Desk…| Small Box| null| 6/18/11|
| 103| 643| 3/24/11| High| 21| 2781.82| 0.07| Express Air|-695.26| 138.14| 35| Monica Federle| Nunavut|Nunavut| Corporate| Office Supplies|Storage & Organiz…|SAFCO Commercial …| Large Box| null| 3/25/11|
| 107| 678| 2/26/10| Low| 44| 228.41| 0.07| Regular Air|-226.36| 4.98| 8.33| Dorothy Badders| Nunavut|Nunavut| Home Office| Office Supplies| Paper| Xerox 198| Small Box| 0.38| 2/26/10|
+——+——–+———-+————–+————–+———+——–+————–+——-+———-+————-+——————+——–+——-+—————-+—————-+——————–+——————–+—————–+——————-+———+
only showing top 10 rows
In order to use SQL commands in SparkSQL we’ll need to register the data as a table:
In [19]:
df.registerTempTable(“Superstore”)
Finally, we can use SQL commands by invoking the spark.sql function and our command in quotes as an argument:
In [20]:
result = spark.sql(“SELECT * FROM Superstore LIMIT 5”).show()
+——+——–+———-+————–+————–+———+——–+————–+——-+———-+————-+——————+——–+——-+—————-+—————-+——————–+——————–+—————–+——————-+———+
|Row ID|Order ID|Order Date|Order Priority|Order Quantity| Sales|Discount| Ship Mode| Profit|Unit Price|Shipping Cost| Customer Name|Province| Region|Customer Segment|Product Category|Product Sub-Category| Product Name|Product Container|Product Base Margin|Ship Date|
+——+——–+———-+————–+————–+———+——–+————–+——-+———-+————-+——————+——–+——-+—————-+—————-+——————–+——————–+—————–+——————-+———+
| 1| 3| 10/13/10| Low| 6| 261.54| 0.04| Regular Air|-213.25| 38.94| 35|Muhammed MacIntyre| Nunavut|Nunavut| Small Business| Office Supplies|Storage & Organiz…|Eldon Base for st…| Large Box| 0.8| 10/20/10|
| 49| 293| 10/1/12| High| 49| 10123.02| 0.07|Delivery Truck| 457.81| 208.16| 68.02| Barry French| Nunavut|Nunavut| Consumer| Office Supplies| Appliances|”1.7 Cubic Foot C…| Jumbo Drum| 0.58| 10/2/12|
| 50| 293| 10/1/12| High| 27| 244.57| 0.01| Regular Air| 46.71| 8.69| 2.99| Barry French| Nunavut|Nunavut| Consumer| Office Supplies|Binders and Binde…|Cardinal Slant-D®…| Small Box| 0.39| 10/3/12|
| 80| 483| 7/10/11| High| 30|4965.7595| 0.08| Regular Air|1198.97| 195.99| 3.99| Clay Rozendal| Nunavut|Nunavut| Corporate| Technology|Telephones and Co…| R380| Small Box| 0.58| 7/12/11|
| 85| 515| 8/28/10| Not Specified| 19| 394.27| 0.08| Regular Air| 30.94| 21.78| 5.94| Carlos Soltero| Nunavut|Nunavut| Consumer| Office Supplies| Appliances|Holmes HEPA Air P…| Medium Box| 0.5| 8/30/10|
+——+——–+———-+————–+————–+———+——–+————–+——-+———-+————-+——————+——–+——-+—————-+—————-+——————–+——————–+—————–+——————-+———+