๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Data Science/Spark

[Spark] pyspark 3.0 dataframe new function ํ…Œ์ŠคํŠธ#3(transform, overlay)

by ํ™ํ›„์ถ” 2020. 7. 6.

 

2020/07/03 - [Data Science/Spark] - [Spark] Colab์—์„œ Spark ์‚ฌ์šฉํ•˜๊ธฐ (pyspark)

 

โ—‹ transform
transform(func)
func : ํ•จ์ˆ˜
dataframe์„ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ฐ›์•„ dataframe์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋˜์—ˆ๋‹ค.
ํŠนํžˆ, 2๊ฐœ์ด์ƒ์˜ ํ•จ์ˆ˜๋ฅผ ํ•œ๋ฒˆ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์–ด ์œ ์šฉํ• ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

#ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ค€๋น„
test_df = spark.createDataFrame([(1, 1.0), (2, 2.0)], ["int", "float"])
test_df.show()
test_df.printSchema()

def cast_all_to_int(input_df):
    return input_df.select([func.col(col_name).cast("int") for col_name in input_df.columns])
# test_df.select([func.col(col_name).cast("int") for col_name in test_df.columns]).show()
def sort_columns_asc(input_df):
    return input_df.select(*sorted(input_df.columns))

#์ด์ „
as_was_df = cast_all_to_int(test_df)
as_was_df.show()
as_was_df.printSchema()

test_df.transform(cast_all_to_int).show()
test_df.transform(cast_all_to_int).printSchema()

๊ฒฐ๊ณผ๊ฐ€ ๊ฐ™๋‹ค.

def cast_all_to_int(input_df):
    return input_df.select([func.col(col_name).cast("int") for col_name in input_df.columns])

def sort_columns_asc(input_df):
    return input_df.select(*sorted(input_df.columns))

# ์ด์ „๋ฐฉ์‹
as_was_df = cast_all_to_int(test_df) 
as_was_df = sort_columns_asc(as_was_df)
as_was_df.show()
as_was_df.printSchema()

# transform์ ์šฉ
test_df.transform(cast_all_to_int).transform(sort_columns_asc).show()
test_df.transform(cast_all_to_int).transform(sort_columns_asc).printSchema()

#๋™์ผํ•œ ๊ฒฐ๊ณผ

 

โ—‹ overlay
- overlay(col1, col2, pos, len)

#overray test
df.show()
df.select(func.overlay("year", "month", 3, 2)).show()


์ฐธ๊ณ 

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html

 

pyspark.sql module — PySpark 3.0.0 documentation

how – str, default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.

spark.apache.org

 

๋Œ“๊ธ€