In Spark DataFrame, we can use where
or filter
to filter out unwanted records. Method where
is an alias for filter
. For filter conditions, we can use either SQL style or expressions (with Spark SQL functions) that return a true or false result. We can use &
or |
to specify multiple conditions in one filter.
Code snippet
The following script shows how to use filters.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
appName = "PySpark DataFrame - where or filter"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
spark.sparkContext.setLogLevel('WARN')
data = [{"a": "100", "b": "200"},
{"a": "1000", "b": "2000"}]
df = spark.createDataFrame(data)
df.show()
df.where("a > 100").show()
df.filter(df.a > 100).show()
df.where((df.b > 100) & (df.a == 1000)).show()
df.where((F.col('b') > 100) & (F.col('a') == 1000)).show()
Output:
+----+----+
| a| b|
+----+----+
| 100| 200|
|1000|2000|
+----+----+
*The results are the same for the above four filters.
Filter out null or none values
Refer to Filter Spark DataFrame Columns with None or Null Values.