๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Data Science/Spark

[Spark] Colab์—์„œ Spark ์‚ฌ์šฉํ•˜๊ธฐ (pyspark)

by ํ™ํ›„์ถ” 2020. 7. 3.

 

 

#2020.07.06 spark3.0 preview2 ๊ธฐ์ค€์œผ๋กœ ์ž‘์„ฑ๋˜์–ด ์ˆ˜์ • (spark3.0-preview2->spark3.0)

spark3.0 / hadoop3.2 ๊ธฐ์ค€์œผ๋กœ ์ž‘์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

1) Colab ์ ‘์†

https://colab.research.google.com/

 

Google Colaboratory

 

colab.research.google.com

2) ์ƒˆ๋…ธํŠธ์ƒ์„ฑ

 

3) openjdk8 ์„ค์น˜

!apt-get install openjdk-8-jdk-headless

 

4) spark3.0 ( hadoop3.2 ) tar ๋‹ค์šด๋กœ๋“œ

!wget -q https://www-us.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

 

5) ์••์ถ•ํ’€๊ธฐ

!tar -xvf spark-3.0.0-bin-hadoop3.2.tgz

 

6) findspark ์„ค์น˜

!pip install findspark

 

7) ํ™˜๊ฒฝ๋ณ€์ˆ˜ ์„ธํŒ…

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

 

8) findspark ์ดˆ๊ธฐํ™” ๋ฐ sparksession ์ƒ์„ฑ

import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("spark3_test").master("local[*]").getOrCreate()

 

9) spark version ํ™•์ธ 

spark.version

 

๋‹ค์Œ ํฌ์ŠคํŠธ๋Š” spark3.0์— ๋Œ€ํ•œ ์ถ”๊ฐ€๊ธฐ๋Šฅ์— ๋Œ€ํ•œ ํ…Œ์ŠคํŠธ์˜ˆ์ •

 

 

reference

- https://colab.research.google.com/drive/1EcotODzgSnLozSH3hDuBfZr06gJXY8I0#scrollTo=zgReRGl0y23D

๋Œ“๊ธ€