PySpark DataFrame - percent_rank() Function

Code description

In Spark SQL, PERCENT_RANK(Spark SQL - PERCENT_RANK Window Function). This code snippet implements percentile ranking (relative ranking) directly using PySpark DataFrame percent_rank API instead of Spark SQL.

Output:

    +-------+-----+------------------+
    |Student|Score|      percent_rank|
    +-------+-----+------------------+
    |    101|   56|               0.0|
    |    109|   66|0.1111111111111111|
    |    103|   70|0.2222222222222222|
    |    110|   73|0.3333333333333333|
    |    107|   75|0.4444444444444444|
    |    102|   78|0.5555555555555556|
    |    108|   81|0.6666666666666666|
    |    104|   93|0.7777777777777778|
    |    105|   95|0.8888888888888888|
    |    106|   95|0.8888888888888888|
    +-------+-----+------------------+

Code snippet

    from pyspark.sql import SparkSession, Window
    from pyspark.sql.functions import percent_rank
    
    app_name = "PySpark percent_rank Window Function"
    master = "local"
    
    spark = SparkSession.builder         .appName(app_name)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    data = [
        [101, 56],
        [102, 78],
        [103, 70],
        [104, 93],
        [105, 95],
        [106, 95],
        [107, 75],
        [108, 81],
        [109, 66],
        [110, 73]]
    
    df = spark.createDataFrame(data, ['Student', 'Score'])
    
    window = Window.orderBy("Score").rowsBetween(
        Window.unboundedPreceding, Window.currentRow)
    df = df.withColumn('percent_rank', percent_rank().over(window))
    
    df.show()

Code description

Code snippet

In this article