MatchError在访问Spark 2.0中的向量列时

我正在尝试在JSON文件上创build一个LDA模型。

用JSON文件创build一个spark上下文:

import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder .master("local") .appName("my-spark-app") .config("spark.some.config.option", "config-value") .getOrCreate() val df = spark.read.json("dbfs:/mnt/JSON6/JSON/sampleDoc.txt") 

显示df应该显示DataFrame

 display(df) 

标记文本

 import org.apache.spark.ml.feature.RegexTokenizer // Set params for RegexTokenizer val tokenizer = new RegexTokenizer() .setPattern("[\\W_]+") .setMinTokenLength(4) // Filter away tokens with length < 4 .setInputCol("text") .setOutputCol("tokens") // Tokenize document val tokenized_df = tokenizer.transform(df) 

这应该显示tokenized_df

 display(tokenized_df) 

获取stopwords

 %sh wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words > -O /tmp/stopwords 

可选:将停用词复制到tmp文件夹

 %fs cp file:/tmp/stopwords dbfs:/tmp/stopwords 

收集所有的stopwords

 val stopwords = sc.textFile("/tmp/stopwords").collect() 

过滤出stopwords

  import org.apache.spark.ml.feature.StopWordsRemover // Set params for StopWordsRemover val remover = new StopWordsRemover() .setStopWords(stopwords) // This parameter is optional .setInputCol("tokens") .setOutputCol("filtered") // Create new DF with Stopwords removed val filtered_df = remover.transform(tokenized_df) 

显示已过滤的df应validation已删除的stopwords

  display(filtered_df) 

向量化词的出现频率

  import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.sql.Row import org.apache.spark.ml.feature.CountVectorizer // Set params for CountVectorizer val vectorizer = new CountVectorizer() .setInputCol("filtered") .setOutputCol("features") .fit(filtered_df) 

validationvectorizer

  vectorizer.transform(filtered_df) .select("id", "text","features","filtered").show() 

在此之后,我看到在LDA中安装此vectorizer器件时遇到了一个问题。 我相信CountVectorizer的问题是给出稀疏vector,但LDA需要密集的vector。 仍然试图找出问题。

这是地图无法转换的例外。

 import org.apache.spark.mllib.linalg.Vector val ldaDF = countVectors.map { case Row(id: String, countVector: Vector) => (id, countVector) } display(ldaDF) 

例外:

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4083.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4083.0 (TID 15331, 10.209.240.17): scala.MatchError: [0,(1252,[13,17,18,20,30,37,45,50,51,53,63,64,96,101,108,125,174,189,214,221,224,227,238,268,291,309,328,357,362,437,441,455,492,493,511,528,561,613,619,674,764,823,839,980,1098,1143],[1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,5.0,1.0,2.0,2.0,1.0,4.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) 

有一个LDA的工作样本没有抛出任何问题

 import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.sql.Row import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA} val a = Vectors.dense(Array(1.0,2.0,3.0)) val b = Vectors.dense(Array(3.0,4.0,5.0)) val df = Seq((1L,a),(2L,b),(2L,a)).toDF val ldaDF = df.map { case Row(id: Long, countVector: Vector) => (id, countVector) } val model = new LDA().setK(3).run(ldaDF.javaRDD) display(df) 

唯一的区别在于我们有一个密集matrix的第二个片段。

这与稀疏无关。 由于Spark 2.0.0 ML Transformers不再生成oasmllib.linalg.VectorUDT而是生成oasmllib.linalg.VectorUDT ,并将其本地映射到oasml.linalg.Vector子类。 这些与Spark 2.0.0中的旧版MLLib API不兼容。

您可以使用Vectors.fromML将其转换为“旧”

 import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import org.apache.spark.ml.linalg.{Vectors => NewVectors} OldVectors.fromML(NewVectors.dense(1.0, 2.0, 3.0)) OldVectors.fromML(NewVectors.sparse(5, Seq(0 -> 1.0, 2 -> 2.0, 4 -> 3.0))) 

但如果您已经使用ML变换器,则使用ML执行LDA更有意义。

为了方便起见,您可以使用隐式转换:

 import scala.languageFeature.implicitConversions object VectorConversions { import org.apache.spark.mllib.{linalg => mllib} import org.apache.spark.ml.{linalg => ml} implicit def toNewVector(v: mllib.Vector) = v.asML implicit def toOldVector(v: ml.Vector) = mllib.Vectors.fromML(v) } 

解决scheme是非常简单的家伙..find下面

 //import org.apache.spark.mllib.linalg.Vector import org.apache.spark.ml.linalg.Vector