如何连接PyCharm和PySpark?
我是新的apache的火花,显然我在我的MacBook中安装了自制软件的apache-spark:
Last login: Fri Jan 8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, Jul 13 2015, 12:05:58) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1 16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user 16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user 16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user) 16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started 16/01/08 14:46:50 INFO Remoting: Starting remoting 16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.64:50199] 16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199. 16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker 16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster 16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95 16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB 16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393 16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server 16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200. 16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator 16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040 16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost 16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201. 16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201 16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager 16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201) 16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Python version 2.7.10 (default, Jul 13 2015 12:05:58) SparkContext available as sc, HiveContext available as sqlContext. >>>
我想开始玩,以了解更多关于MLlib。 不过,我使用Pycharm在python中编写脚本。 问题是:当我去Pycharm并尝试调用pyspark时,Pycharm找不到该模块。 我尝试添加到Pycharm的path如下:
然后从博客我试过这个:
import os import sys # Path for spark source folder os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4" # Append pyspark to Python Path sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark") try: from pyspark import SparkContext from pyspark import SparkConf print ("Successfully imported Spark Modules") except ImportError as e: print ("Can not import Spark Modules", e) sys.exit(1)
还是不能开始使用PySpark与Pycharm,任何想法如何“链接”PyCharm与Apache-PYSKO?
更新:
然后我searchapache-spark和pythonpath来设置Pycharm的环境variables:
apache-sparkpath:
user@MacBook-Pro-User-2:~$ brew info apache-spark apache-spark: stable 1.6.0, HEAD Engine for large-scale data processing https://spark.apache.org/ /usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) * Poured from bottle From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb
pythonpath:
user@MacBook-Pro-User-2:~$ brew info python python: stable 2.7.11 (bottled), HEAD Interpreted, interactive, object-oriented programming language https://www.python.org /usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *
然后用上面的信息我尝试设置环境variables如下:
任何想法如何正确链接Pycharm与pyspark?
然后,当我运行上面的configurationpython脚本我有这个例外:
/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py Traceback (most recent call last): File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, in <module> from pyspark import SparkContext ImportError: No module named pyspark
更新:然后我尝试了@ zero323提出的这种configuration
configuration1:
/usr/local/Cellar/apache-spark/1.5.1/
出:
user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls CHANGES.txt NOTICE libexec/ INSTALL_RECEIPT.json README.md LICENSE bin/
configuration2:
/usr/local/Cellar/apache-spark/1.5.1/libexec
出:
user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls R/ bin/ data/ examples/ python/ RELEASE conf/ ec2/ lib/ sbin/
创build运行configuration :
- 进入运行 – > 编辑configuration
- 添加新的Pythonconfiguration
- 设置脚本path,使其指向您要执行的脚本
-
编辑环境variables字段,使其至less包含:
-
SPARK_HOME
– 它应该指向安装Spark的目录。 它应该包含bin
(带有spark-submit
,spark-shell
等)和conf
(带有spark-defaults.conf
,spark-env.sh
等)的目录。 -
PYTHONPATH
– 它应该包含$SPARK_HOME/python
和可选的$SPARK_HOME/python/lib/py4j-some-version.src.zip
否则不可用。some-version
应该匹配给定的Spark安装所使用的Py4J版本(0.8.2.1 – 1.5,0.9 – 1.6.0)
-
-
应用设置
将PySpark库添加到解释器path(代码完成所需的) :
- 转到文件 – > 设置 – > 项目解释器
- 打开要用于Spark的解释器的设置
- 编辑解释器path,使其包含到
$SPARK_HOME/python
path(如果需要,则为Py4J) - 保存设置
使用新创build的configuration来运行脚本。
Spark 2.2.0及更高版本 :
在将SPARK-1267合并后,您应该能够通过在您用于PyCharm开发的环境中安装Spark来简化这一过程。
以下是我在mac osx上解决这个问题的方法。
-
brew install apache-spark
-
将其添加到〜/ .bash_profile
export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1` export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec" export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
-
将pyspark和py4j添加到内容根目录(使用正确的Spark版本):
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip
这里是适合我的设置
设置智能感知:
点击文件 – >设置 – >项目: – >项目解释器
点击Project Interpreter下拉菜单右侧的齿轮图标
从上下文菜单中单击更多….
select解释器,然后单击“显示path”图标(右下angular)
点击+图标两个添加以下path:
\ python的\ LIB \ py4j-0.9-src.zip
\ BIN \ python的\ LIB \ pyspark.zip
点击确定,好的,好的
继续并testing您的新智能感知function。
在pycharm中configurationpyspark(windows)
File menu - settings - project interpreter - (gearshape) - more - (treebelowfunnel) - (+) - [add python folder form spark installation and then py4j-*.zip] - click ok
确保在windows环境下设置SPARK_HOME,pycharm会从那里取出。 确认 :
Run menu - edit configurations - environment variables - [...] - show
可以在环境variables中设置SPARK_CONF_DIR。
我使用下面的页面作为参考,并且能够在PyCharm 5中导入pyspark / Spark 1.6.1(通过自制软件安装)。
http://renien.com/blog/accessing-pyspark-pycharm/
import os import sys # Path for spark source folder os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.6.1" # Append pyspark to Python Path sys.path.append("/usr/local/Cellar/apache-spark/1.6.1/libexec/python") try: from pyspark import SparkContext from pyspark import SparkConf print ("Successfully imported Spark Modules") except ImportError as e: print ("Can not import Spark Modules", e) sys.exit(1)
与上面,pyspark加载,但是当我尝试创build一个SparkContext时,我得到一个网关错误。 Spark从自制软件中出现了一些问题,所以我只是从Spark网站上下载了Spark(下载Hadoop 2.6以后的预编译版本),然后指向下面的spark和py4j目录。 这是pycharm中的代码!
import os import sys # Path for spark source folder os.environ['SPARK_HOME']="/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6" # Need to Explicitly point to python3 if you are using Python 3.x os.environ['PYSPARK_PYTHON']="/usr/local/Cellar/python3/3.5.1/bin/python3" #You might need to enter your local IP #os.environ['SPARK_LOCAL_IP']="192.168.2.138" #Path for pyspark and py4j sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python") sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip") try: from pyspark import SparkContext from pyspark import SparkConf print ("Successfully imported Spark Modules") except ImportError as e: print ("Can not import Spark Modules", e) sys.exit(1) sc = SparkContext('local') words = sc.parallelize(["scala","java","hadoop","spark","akka"]) print(words.count())
我从这些说明中获得了很多帮助,这些说明帮助我在PyDev中进行故障排除,然后让它工作PyCharm – https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing -with-蟒和-火花上的hadoop /
我敢肯定,有人花了几个小时对着他们的显示器,试图让这个工作,所以希望这有助于保存他们的理智!
看看这个video。
假设你的spark python目录是: /home/user/spark/python
假设你的Py4j源代码是: /home/user/spark/python/lib/py4j-0.9-src.zip
基本上你把spark python目录和py4j目录添加到解释器path中。 我没有足够的声誉发布截图或我会。
在video中,用户在pycharm中创build一个虚拟环境,但是你可以在pycharm之外创build一个虚拟环境,或者激活一个预先存在的虚拟环境,然后启动pycharm并将这些path添加到虚拟环境解释器path在pycharm内。
我使用其他方法通过bash环境variables添加spark,pycharm以外的工作很好,但是由于某种原因,pycharm中没有识别这些variables,但是这种方法非常完美。
在启动IDE或Python之前,您需要设置PYTHONPATH,SPARK_HOME。
Windows,编辑环境variables,添加spark python和py4j到
PYTHONPATH=%PYTHONPATH%;{py4j};{spark python}
Unix的,
export PYTHONPATH=${PYTHONPATH};{py4j};{spark/python}
我随后在线教程,并将envvariables添加到.bashrc中:
# add pyspark to python export SPARK_HOME=/home/lolo/spark-1.6.1 export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
然后,我只是在SPARK_HOME和PYTHONPATH中获得了pycharm的值:
(srz-reco)lolo@K:~$ echo $SPARK_HOME /home/lolo/spark-1.6.1 (srz-reco)lolo@K:~$ echo $PYTHONPATH /home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/python/lib/py4j-0.8.2.1-src.zip:/python/:
然后我将它复制到脚本的运行/debuggingconfiguration – >环境variables。
从文档 :
要在Python中运行Spark应用程序,请使用位于Spark目录中的bin / spark-submit脚本。 该脚本将加载Spark的Java / Scala库,并允许您将应用程序提交到群集。 你也可以使用bin / pyspark来启动一个交互式的Python shell。
你正在用CPython解释器直接调用你的脚本,我认为这是造成问题的原因。
尝试运行您的脚本:
"${SPARK_HOME}"/bin/spark-submit test_1.py
如果这样做,你应该能够通过设置项目的解释器spark-submit来使它在PyCharm中工作。
最简单的方法是
转到anaconda / python安装的site-packages文件夹,在那里复制粘贴pyspark和pyspark.egg-info文件夹。
重新启动pycharm来更新索引。 上面提到的两个文件夹出现在你的spark安装的spark / python文件夹中。 这样,您将从pycharm获得代码完成build议。
站点包可以很容易地在你的python安装中find。 在anaconda中,它位于anaconda / lib / pythonx.x / site-packages下