PySpark DataFrame - percent_rank() Function

In Spark SQL, PERCENT\_RANK([Spark SQL - PERCENT\_RANK Window Function](https://kontext.tech/article/842/spark-sql-percent-rank-window-function)). This code snippet implements percentile ranking (relative ranking) directly using PySpark DataFrame `percent_rank` API instead of Spark SQL. Output: ``` +-------+-----+------------------+ |Student|Score| percent_rank| +-------+-----+------------------+ | 101| 56| 0.0| | 109| 66|0.1111111111111111| | 103| 70|0.2222222222222222| | 110| 73|0.3333333333333333| | 107| 75|0.4444444444444444| | 102| 78|0.5555555555555556| | 108| 81|0.6666666666666666| | 104| 93|0.7777777777777778| | 105| 95|0.8888888888888888| | 106| 95|0.8888888888888888| +-------+-----+------------------+ ```

Kontext Kontext 0 2086 2.00 index 8/18/2022

Code description

In Spark SQL, PERCENT_RANK(Spark SQL - PERCENT_RANK Window Function). This code snippet implements percentile ranking (relative ranking) directly using PySpark DataFrame percent_rank API instead of Spark SQL.

Output:

    +-------+-----+------------------+
    |Student|Score|      percent_rank|
    +-------+-----+------------------+
    |    101|   56|               0.0|
    |    109|   66|0.1111111111111111|
    |    103|   70|0.2222222222222222|
    |    110|   73|0.3333333333333333|
    |    107|   75|0.4444444444444444|
    |    102|   78|0.5555555555555556|
    |    108|   81|0.6666666666666666|
    |    104|   93|0.7777777777777778|
    |    105|   95|0.8888888888888888|
    |    106|   95|0.8888888888888888|
    +-------+-----+------------------+  
    

Code snippet

    from pyspark.sql import SparkSession, Window
    from pyspark.sql.functions import percent_rank
    
    app_name = "PySpark percent_rank Window Function"
    master = "local"
    
    spark = SparkSession.builder         .appName(app_name)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    data = [
        [101, 56],
        [102, 78],
        [103, 70],
        [104, 93],
        [105, 95],
        [106, 95],
        [107, 75],
        [108, 81],
        [109, 66],
        [110, 73]]
    
    df = spark.createDataFrame(data, ['Student', 'Score'])
    
    window = Window.orderBy("Score").rowsBetween(
        Window.unboundedPreceding, Window.currentRow)
    df = df.withColumn('percent_rank', percent_rank().over(window))
    
    df.show()
    
pyspark spark-sql

Join the Discussion

View or add your thoughts below

Comments