PySpark DataFrame - Select Columns using select Function

In PySpark, we can use `select `function to select a subset or all columns from a DataFrame. ## Syntax ``` DataFrame.select(*cols) ``` This function returns a new `DataFrame `object based on the projection expression list. This code snippet prints out the following output: ``` +---+----------------+-------+---+ | id|customer_profile| name|age| +---+----------------+-------+---+ | 1| {Kontext, 3}|Kontext| 3| | 2| {Tech, 10}| Tech| 10| +---+----------------+-------+---+ ```

Kontext Kontext 0 481 0.46 index 8/11/2022

Code description

In PySpark, we can use select function to select a subset or all columns from a DataFrame.

Syntax

    DataFrame.select(*cols)

This function returns a new DataFrame object based on the projection expression list. 

This code snippet prints out the following output:

    +---+----------------+-------+---+
    | id|customer_profile|   name|age|
    +---+----------------+-------+---+
    |  1|    {Kontext, 3}|Kontext|  3|
    |  2|      {Tech, 10}|   Tech| 10|
    +---+----------------+-------+---+  
    

Code snippet

    from pyspark.sql import SparkSession
    from pyspark.sql.types import StructType, StructField, StringType, IntegerType
    
    appName = "PySpark Example - select"
    master = "local"
    
    # Create Spark session
    spark = SparkSession.builder         .appName(appName)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    data = [{"id": 1, "customer_profile": {"name": "Kontext", "age": 3}},
            {"id": 2, "customer_profile": {"name": "Tech", "age": 10}}]
    
    customer_schema = StructType([
        StructField('name', StringType(), True),
        StructField('age', IntegerType(), True),
    ])
    df_schema = StructType([StructField("id", IntegerType(), True), StructField(
        "customer_profile", customer_schema, False)])
    df = spark.createDataFrame(data, df_schema)
    print(df.schema)
    df.show()
    
    # select certain columns
    df.select('*', "customer_profile.name", "customer_profile.age").show()
pyspark python

Join the Discussion

View or add your thoughts below

Comments