PySpark DataFrame - inner, left and right Joins

PySpark DataFrame - inner, left and right Joins

This code snippet shows you how to perform inner, left and right joins with `DataFrame.join` API. ``` def join(self, other, on=None, how=None) ``` ### Supported join types The default join type is **inner**. The supported values for parameter `how `are: **inner, cross, outer, full, fullouter, full\_outer, left, leftouter, left\_outer, right, rightouter, right\_outer, semi, leftsemi, left\_semi, anti, leftanti and left\_anti**. To learn about the these different join types, refer to article [Spark SQL Joins with Examples](https://kontext.tech/article/1003/spark-sql-joins-with-examples). ### Join via multiple columns If there are more than one column to join, we can specify `on `parameter as a list of column name: ``` df1.join(df2, on=['id','other_column'], how='left') ``` Output from the code snippet: ``` +---+----+ | id|attr| +---+----+ | 1| A| | 2| B| +---+----+ +---+--------+ | id|attr_int| +---+--------+ | 1| 100| | 2| 200| | 3| 300| +---+--------+ +---+----+--------+ | id|attr|attr_int| +---+----+--------+ | 1| A| 100| | 2| B| 200| +---+----+--------+ +---+----+--------+ | id|attr|attr_int| +---+----+--------+ | 1| A| 100| | 2| B| 200| +---+----+--------+ +---+----+--------+ | id|attr|attr_int| +---+----+--------+ | 1| A| 100| | 2| B| 200| | 3|null| 300| +---+----+--------+ ```

Kontext Kontext 0 288 0.27 index 8/21/2022

Code description

This code snippet shows you how to perform inner, left and right joins with DataFrame.join API. 

    def join(self, other, on=None, how=None)  
    

Supported join types

The default join type is inner. The supported values for parameter how are: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.

To learn about the these different join types, refer to article Spark SQL Joins with Examples.

Join via multiple columns

If there are more than one column to join, we can specify on parameter as a list of column name:

    df1.join(df2, on=['id','other_column'], how='left')  
    

Output from the code snippet:

    +---+----+
    | id|attr|
    +---+----+
    |  1|   A|
    |  2|   B|
    +---+----+
    
    +---+--------+
    | id|attr_int|
    +---+--------+
    |  1|     100|
    |  2|     200|
    |  3|     300|
    +---+--------+
    
    +---+----+--------+
    | id|attr|attr_int|
    +---+----+--------+
    |  1|   A|     100|
    |  2|   B|     200|
    +---+----+--------+
    
    +---+----+--------+
    | id|attr|attr_int|
    +---+----+--------+
    |  1|   A|     100|
    |  2|   B|     200|
    +---+----+--------+
    
    +---+----+--------+
    | id|attr|attr_int|
    +---+----+--------+
    |  1|   A|     100|
    |  2|   B|     200|
    |  3|null|     300|
    +---+----+--------+  
    

Code snippet

    from pyspark.sql import SparkSession
    from pyspark import SparkConf
    
    app_name = "PySpark - Joins Example"
    master = "local[8]"
    
    conf = SparkConf().setAppName(app_name)        .setMaster(master)
    
    spark = SparkSession.builder.config(conf=conf)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    df1 = spark.createDataFrame([[1, 'A'], [2, 'B']], ['id', 'attr'])
    df1.show()
    df2 = spark.createDataFrame([[1, 100], [2, 200], [3, 300]], ['id', 'attr_int'])
    df2.show()
    
    # Joins 
    df = df1.join(df2, on='id', how='inner')
    df.show()
    
    df = df1.join(df2, on='id', how='left')
    df.show()
    
    df = df1.join(df2, on='id', how='right')
    df.show()
pyspark

Join the Discussion

View or add your thoughts below

Comments