如何连接PyCharm和PySpark?

我是新的apache的火花,显然我在我的MacBook中安装了自制软件的apache-spark:

Last login: Fri Jan 8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, Jul 13 2015, 12:05:58) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1 16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user 16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user 16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user) 16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started 16/01/08 14:46:50 INFO Remoting: Starting remoting 16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.64:50199] 16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199. 16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker 16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster 16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95 16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB 16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393 16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server 16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200. 16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator 16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040 16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost 16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201. 16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201 16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager 16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201) 16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Python version 2.7.10 (default, Jul 13 2015 12:05:58) SparkContext available as sc, HiveContext available as sqlContext. >>> 

我想开始玩,以了解更多关于MLlib。 不过,我使用Pycharm在python中编写脚本。 问题是:当我去Pycharm并尝试调用pyspark时,Pycharm找不到该模块。 我尝试添加到Pycharm的path如下:

不能链接pycharm与火花

然后从博客我试过这个:

 import os import sys # Path for spark source folder os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4" # Append pyspark to Python Path sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark") try: from pyspark import SparkContext from pyspark import SparkConf print ("Successfully imported Spark Modules") except ImportError as e: print ("Can not import Spark Modules", e) sys.exit(1) 

还是不能开始使用PySpark与Pycharm,任何想法如何“链接”PyCharm与Apache-PYSKO?

更新:

然后我searchapache-spark和pythonpath来设置Pycharm的环境variables:

apache-sparkpath:

 user@MacBook-Pro-User-2:~$ brew info apache-spark apache-spark: stable 1.6.0, HEAD Engine for large-scale data processing https://spark.apache.org/ /usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) * Poured from bottle From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb 

pythonpath:

 user@MacBook-Pro-User-2:~$ brew info python python: stable 2.7.11 (bottled), HEAD Interpreted, interactive, object-oriented programming language https://www.python.org /usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) * 

然后用上面的信息我尝试设置环境variables如下:

配置1

任何想法如何正确链接Pycharm与pyspark?

然后,当我运行上面的configurationpython脚本我有这个例外:

 /usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py Traceback (most recent call last): File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, in <module> from pyspark import SparkContext ImportError: No module named pyspark 

更新:然后我尝试了@ zero323提出的这种configuration

configuration1:

 /usr/local/Cellar/apache-spark/1.5.1/ 

conf 1

出:

  user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls CHANGES.txt NOTICE libexec/ INSTALL_RECEIPT.json README.md LICENSE bin/ 

configuration2:

 /usr/local/Cellar/apache-spark/1.5.1/libexec 

在这里输入图像描述

出:

 user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls R/ bin/ data/ examples/ python/ RELEASE conf/ ec2/ lib/ sbin/ 

创build运行configuration

  1. 进入运行 – > 编辑configuration
  2. 添加新的Pythonconfiguration
  3. 设置脚本path,使其指向您要执行的脚本
  4. 编辑环境variables字段,使其至less包含:

    • SPARK_HOME – 它应该指向安装Spark的目录。 它应该包含bin (带有spark-submitspark-shell等)和conf (带有spark-defaults.confspark-env.sh等)的目录。
    • PYTHONPATH – 它应该包含$SPARK_HOME/python和可选的$SPARK_HOME/python/lib/py4j-some-version.src.zip否则不可用。 some-version应该匹配给定的Spark安装所使用的Py4J版本(0.8.2.1 – 1.5,0.9 – 1.6.0)

      在这里输入图像描述

  5. 应用设置

将PySpark库添加到解释器path(代码完成所需的)

  1. 转到文件 – > 设置 – > 项目解释器
  2. 打开要用于Spark的解释器的设置
  3. 编辑解释器path,使其包含到$SPARK_HOME/pythonpath(如果需要,则为Py4J)
  4. 保存设置

使用新创build的configuration来运行脚本。

Spark 2.2.0及更高版本

在将SPARK-1267合并后,您应该能够通过在您用于PyCharm开发的环境中安装Spark来简化这一过程。

以下是我在mac osx上解决这个问题的方法。

  1. brew install apache-spark
  2. 将其添加到〜/ .bash_profile

     export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1` export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec" export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH 
  3. 将pyspark和py4j添加到内容根目录(使用正确的Spark版本):

     /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip 

在这里输入图像描述

这里是适合我的设置

设置智能感知:

  1. 点击文件 – >设置 – >项目: – >项目解释器

  2. 点击Project Interpreter下拉菜单右侧的齿轮图标

  3. 从上下文菜单中单击更多….

  4. select解释器,然后单击“显示path”图标(右下angular)

  5. 点击+图标两个添加以下path:

    \ python的\ LIB \ py4j-0.9-src.zip

    \ BIN \ python的\ LIB \ pyspark.zip

  6. 点击确定,好的,好的

继续并testing您的新智能感知function。

在pycharm中configurationpyspark(windows)

 File menu - settings - project interpreter - (gearshape) - more - (treebelowfunnel) - (+) - [add python folder form spark installation and then py4j-*.zip] - click ok 

确保在windows环境下设置SPARK_HOME,pycharm会从那里取出。 确认 :

 Run menu - edit configurations - environment variables - [...] - show 

可以在环境variables中设置SPARK_CONF_DIR。

我使用下面的页面作为参考,并且能够在PyCharm 5中导入pyspark / Spark 1.6.1(通过自制软件安装)。

http://renien.com/blog/accessing-pyspark-pycharm/

 import os import sys # Path for spark source folder os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.6.1" # Append pyspark to Python Path sys.path.append("/usr/local/Cellar/apache-spark/1.6.1/libexec/python") try: from pyspark import SparkContext from pyspark import SparkConf print ("Successfully imported Spark Modules") except ImportError as e: print ("Can not import Spark Modules", e) sys.exit(1) 

与上面,pyspark加载,但是当我尝试创build一个SparkContext时,我得到一个网关错误。 Spark从自制软件中出现了一些问题,所以我只是从Spark网站上下载了Spark(下载Hadoop 2.6以后的预编译版本),然后指向下面的spark和py4j目录。 这是pycharm中的代码!

 import os import sys # Path for spark source folder os.environ['SPARK_HOME']="/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6" # Need to Explicitly point to python3 if you are using Python 3.x os.environ['PYSPARK_PYTHON']="/usr/local/Cellar/python3/3.5.1/bin/python3" #You might need to enter your local IP #os.environ['SPARK_LOCAL_IP']="192.168.2.138" #Path for pyspark and py4j sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python") sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip") try: from pyspark import SparkContext from pyspark import SparkConf print ("Successfully imported Spark Modules") except ImportError as e: print ("Can not import Spark Modules", e) sys.exit(1) sc = SparkContext('local') words = sc.parallelize(["scala","java","hadoop","spark","akka"]) print(words.count()) 

我从这些说明中获得了很多帮助,这些说明帮助我在PyDev中进行故障排除,然后让它工作PyCharm – https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing -with-蟒和-火花上的hadoop /

我敢肯定,有人花了几个小时对着他们的显示器,试图让这个工作,所以希望这有助于保存他们的理智!

看看这个video。

假设你的spark python目录是: /home/user/spark/python

假设你的Py4j源代码是: /home/user/spark/python/lib/py4j-0.9-src.zip

基本上你把spark python目录和py4j目录添加到解释器path中。 我没有足够的声誉发布截图或我会。

在video中,用户在pycharm中创build一个虚拟环境,但是你可以在pycharm之外创build一个虚拟环境,或者激活一个预先存在的虚拟环境,然后启动pycharm并将这些path添加到虚拟环境解释器path在pycharm内。

我使用其他方法通过bash环境variables添加spark,pycharm以外的工作很好,但是由于某种原因,pycharm中没有识别这些variables,但是这种方法非常完美。

在启动IDE或Python之前,您需要设置PYTHONPATH,SPARK_HOME。

Windows,编辑环境variables,添加spark python和py4j到

 PYTHONPATH=%PYTHONPATH%;{py4j};{spark python} 

Unix的,

 export PYTHONPATH=${PYTHONPATH};{py4j};{spark/python} 

我随后在线教程,并将envvariables添加到.bashrc中:

 # add pyspark to python export SPARK_HOME=/home/lolo/spark-1.6.1 export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH 

然后,我只是在SPARK_HOME和PYTHONPATH中获得了pycharm的值:

 (srz-reco)lolo@K:~$ echo $SPARK_HOME /home/lolo/spark-1.6.1 (srz-reco)lolo@K:~$ echo $PYTHONPATH /home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/python/lib/py4j-0.8.2.1-src.zip:/python/: 

然后我将它复制到脚本的运行/debuggingconfiguration – >环境variables。

从文档 :

要在Python中运行Spark应用程序,请使用位于Spark目录中的bin / spark-submit脚本。 该脚本将加载Spark的Java / Scala库,并允许您将应用程序提交到群集。 你也可以使用bin / pyspark来启动一个交互式的Python shell。

你正在用CPython解释器直接调用你的脚本,我认为这是造成问题的原因。

尝试运行您的脚本:

 "${SPARK_HOME}"/bin/spark-submit test_1.py 

如果这样做,你应该能够通过设置项目的解释器spark-submit来使它在PyCharm中工作。

最简单的方法是

转到anaconda / python安装的site-packages文件夹,在那里复制粘贴pysparkpyspark.egg-info文件夹。

重新启动pycharm来更新索引。 上面提到的两个文件夹出现在你的spark安装的spark / python文件夹中。 这样,您将从pycharm获得代码完成build议。

站点包可以很容易地在你的python安装中find。 在anaconda中,它位于anaconda / lib / pythonx.x / site-packages下