2020/07/03 - [Data Science/Spark] - [Spark] Colab์์ Spark ์ฌ์ฉํ๊ธฐ (pyspark)
โ transform
transform(func)
func : ํจ์
dataframe์ ํ๋ผ๋ฏธํฐ๋ก ๋ฐ์ dataframe์ ๋ฐํํ๋ ํจ์๋ฅผ ๊ฐ๋จํ๊ฒ ์ฌ์ฉํ ์ ์๋๋ก ๋์๋ค.
ํนํ, 2๊ฐ์ด์์ ํจ์๋ฅผ ํ๋ฒ์ ์ฌ์ฉํ ์ ์๊ฒ ๋์ด ์ ์ฉํ ๊ฒ์ผ๋ก ๋ณด์ธ๋ค.
#ํ
์คํธ ๋ฐ์ดํฐํ๋ ์ ์ค๋น
test_df = spark.createDataFrame([(1, 1.0), (2, 2.0)], ["int", "float"])
test_df.show()
test_df.printSchema()
def cast_all_to_int(input_df):
return input_df.select([func.col(col_name).cast("int") for col_name in input_df.columns])
# test_df.select([func.col(col_name).cast("int") for col_name in test_df.columns]).show()
def sort_columns_asc(input_df):
return input_df.select(*sorted(input_df.columns))
#์ด์
as_was_df = cast_all_to_int(test_df)
as_was_df.show()
as_was_df.printSchema()
test_df.transform(cast_all_to_int).show()
test_df.transform(cast_all_to_int).printSchema()
๊ฒฐ๊ณผ๊ฐ ๊ฐ๋ค.
def cast_all_to_int(input_df):
return input_df.select([func.col(col_name).cast("int") for col_name in input_df.columns])
def sort_columns_asc(input_df):
return input_df.select(*sorted(input_df.columns))
# ์ด์ ๋ฐฉ์
as_was_df = cast_all_to_int(test_df)
as_was_df = sort_columns_asc(as_was_df)
as_was_df.show()
as_was_df.printSchema()
# transform์ ์ฉ
test_df.transform(cast_all_to_int).transform(sort_columns_asc).show()
test_df.transform(cast_all_to_int).transform(sort_columns_asc).printSchema()
#๋์ผํ ๊ฒฐ๊ณผ
โ overlay
- overlay(col1, col2, pos, len)
#overray test
df.show()
df.select(func.overlay("year", "month", 3, 2)).show()
์ฐธ๊ณ
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html
๋๊ธ