Remove Special Characters from Column in PySpark DataFrame

Spark SQL function `regex_replace` can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, `[^0-9a-zA-Z_\-]+` can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(\_); regular expression '`[@\+\#\$\%\^\!]+`' can match these defined special characters. This code snippet replace special characters with empty string. Output: ``` +---+--------------------------+ |id |str | +---+--------------------------+ |1 |ABCDEDF!@#$%%^123456qwerty| |2 |ABCDE!!! | +---+--------------------------+ +---+-------------------+ | id| replaced_str| +---+-------------------+ | 1|ABCDEDF123456qwerty| | 2| ABCDE| +---+-------------------+ ```

Kontext Kontext 0 21407 20.50 index 8/19/2022

Code description

Spark SQL function regex_replace can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, [^0-9a-zA-Z_\-]+ can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(_); regular expression '[@\+\#\$\%\^\!]+' can match these defined special characters.

This code snippet replace special characters with empty string.

Output:

    +---+--------------------------+
    |id |str                       |
    +---+--------------------------+
    |1  |ABCDEDF!@#$%%^123456qwerty|
    |2  |ABCDE!!!                  |
    +---+--------------------------+
    
    +---+-------------------+
    | id|       replaced_str|
    +---+-------------------+
    |  1|ABCDEDF123456qwerty|
    |  2|              ABCDE|
    +---+-------------------+  
    

Code snippet

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import regexp_replace
    
    app_name = "PySpark regex_replace Example"
    master = "local"
    
    spark = SparkSession.builder         .appName(app_name)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    data = [[1, 'ABCDEDF!@#$%%^123456qwerty'],
            [2, 'ABCDE!!!']
            ]
    
    df = spark.createDataFrame(data, ['id', 'str'])
    
    df.show(truncate=False)
    
    df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", ""
                                        ).alias('replaced_str'))
    
    df.show()
    
pyspark python

Join the Discussion

View or add your thoughts below

Comments