Spark DataFrames supports complex data types like array. This code snippet provides one example to check whether specific value exists in an array column using array_containsfunction.
Code snippet
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, IntegerType, StringType, StructField, StructType
from pyspark.sql.functions import array_contains
appName = "PySpark Example - array_contains"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
# Sample data
data = [(1, ['apple', 'pear', 'kiwi']), (2, ['apple']), (3, ['pear', 'berry'])]
# schema
schema = StructType([StructField("ID", IntegerType(), True),
StructField("Tags", ArrayType(StringType()), True)])
# Create Spark DaraFrame from pandas DataFrame
df = spark.createDataFrame(data, schema)
print(df.schema)
df.show()
# Show records contain apple in Tags column only
df.where(array_contains('Tags', 'apple')).show()
# Show records don't contain apple in Tags column only
df.where(array_contains('Tags', 'apple') == False).show()
spark.stop()
The code snippet constructs a Spark DataFrame using data in memory. The schema looks like the following:
StructType(List(StructField(ID,IntegerType,true),StructField(Tags,ArrayType(StringType,true),true)))
The output:
+---+-------------------+
| ID| Tags|
+---+-------------------+
| 1|[apple, pear, kiwi]|
| 2| [apple]|
| 3| [pear, berry]|
+---+-------------------+
+---+-------------------+
| ID| Tags|
+---+-------------------+
| 1|[apple, pear, kiwi]|
| 2| [apple]|
+---+-------------------+
+---+-------------+
| ID| Tags|
+---+-------------+
| 3|[pear, berry]|
+---+-------------+
The second result prints out the records with word 'apple' in Tagsarray column; the third one prints out the ones without.