PySpark DataFrame - union, unionAll and unionByName

PySpark DataFrame provides three methods to union data together: `union`, `unionAll `and `unionByName`. The first two are like Spark SQL  UNION ALL clause which doesn't remove duplicates.  `unionAll `is the alias for `union`. We can use `distinct `method to deduplicate. The third function will use column names to resolve columns instead of positions. unionByName can also be used to merge two DataFrames with different schemas. ### Syntax of unionByName ``` DataFrame.unionByName(other, allowMissingColumns=False) ``` If `allowMissingColumns` is specified as `True`, the missing columns in both DataFrame will be added with default value `null`. This parameter is only available from Spark 3.1.0. For lower versions, the follow error may appear: > df1.unionByName(df4, allowMissingColumns=True).show(truncate=False) > TypeError: unionByName() got an unexpected keyword argument 'allowMissingColumns' ### Outputs The following are the outputs from the code snippet. ``` +---+---+---+ |c1 |c2 |c3 | +---+---+---+ |1 |A |100| |2 |B |200| +---+---+---+ +---+---+---+ |c1 |c2 |c3 | +---+---+---+ |1 |A |100| |1 |A |100| |2 |B |200| +---+---+---+ +---+---+---+ |c1 |c2 |c3 | +---+---+---+ |1 |A |100| |2 |B |200| +---+---+---+ +---+---+----+----+ |c1 |c2 |c3 |c4 | +---+---+----+----+ |1 |A |100 |null| |3 |C |null|ABC | +---+---+----+----+ ```

Kontext Kontext 0 3175 3.03 index 8/16/2022

Code description

PySpark DataFrame provides three methods to union data together: union, unionAll and unionByName. The first two are like Spark SQL  UNION ALL clause which doesn't remove duplicates.  unionAll is the alias for union. We can use distinct method to deduplicate. The third function will use column names to resolve columns instead of positions. unionByName can also be used to merge two DataFrames with different schemas. 

Syntax of unionByName

    DataFrame.unionByName(other, allowMissingColumns=False)  
    

If allowMissingColumns is specified as True, the missing columns in both DataFrame will be added with default value null. This parameter is only available from Spark 3.1.0. For lower versions, the follow error may appear:

    df1.unionByName(df4, allowMissingColumns=True).show(truncate=False)
TypeError: unionByName() got an unexpected keyword argument 'allowMissingColumns'

Outputs

The following are the outputs from the code snippet.

    +---+---+---+
    |c1 |c2 |c3 |
    +---+---+---+
    |1  |A  |100|
    |2  |B  |200|
    +---+---+---+
    
    +---+---+---+
    |c1 |c2 |c3 |
    +---+---+---+
    |1  |A  |100|
    |1  |A  |100|
    |2  |B  |200|
    +---+---+---+
    
    +---+---+---+
    |c1 |c2 |c3 |
    +---+---+---+
    |1  |A  |100|
    |2  |B  |200|
    +---+---+---+
    
    +---+---+----+----+
    |c1 |c2 |c3  |c4  |
    +---+---+----+----+
    |1  |A  |100 |null|
    |3  |C  |null|ABC |
    +---+---+----+----+  
    

Code snippet

    from pyspark.sql import SparkSession
    
    appName = "PySpark union Examples"
    master = "local"
    
    spark = SparkSession.builder         .appName(appName)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("ERROR")
    
    # Create DataFrames
    df1 = spark.createDataFrame([[1, 'A', 100]], ['c1', 'c2', 'c3'])
    df2 = spark.createDataFrame([[1, 'A', 100], [2, 'B', 200]], ['c1', 'c2', 'c3'])
    df3 = spark.createDataFrame([[2, 200, 'B']], ['c1',  'c3', 'c2'])
    df4 = spark.createDataFrame([[3, 'C', 'ABC']], ['c1', 'c2', 'c4'])
    
    # Use union method
    df1.union(df2).distinct().show(truncate=False)
    
    # Use unionAll method
    df1.unionAll(df2).show(truncate=False)
    
    # Use unionByName method
    df1.unionByName(df3).show(truncate=False)
    df1.unionByName(df4, allowMissingColumns=True).show(truncate=False)
    
pyspark

Join the Discussion

View or add your thoughts below

Comments