Remove Special Characters from Column in PySpark DataFrame

Code description

Spark SQL function regex_replace can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, [^0-9a-zA-Z_\-]+ can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(_); regular expression '[@\+\#\$\%\^\!]+' can match these defined special characters.

This code snippet replace special characters with empty string.

Output:

    +---+--------------------------+
    |id |str                       |
    +---+--------------------------+
    |1  |ABCDEDF!@#$%%^123456qwerty|
    |2  |ABCDE!!!                  |
    +---+--------------------------+
    
    +---+-------------------+
    | id|       replaced_str|
    +---+-------------------+
    |  1|ABCDEDF123456qwerty|
    |  2|              ABCDE|
    +---+-------------------+

Code snippet

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import regexp_replace
    
    app_name = "PySpark regex_replace Example"
    master = "local"
    
    spark = SparkSession.builder         .appName(app_name)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    data = [[1, 'ABCDEDF!@#$%%^123456qwerty'],
            [2, 'ABCDE!!!']
            ]
    
    df = spark.createDataFrame(data, ['id', 'str'])
    
    df.show(truncate=False)
    
    df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", ""
                                        ).alias('replaced_str'))
    
    df.show()

Code description

Code snippet

In this article