Home Spark dataframe statements work differently inside class definition
Reply: 2

Spark dataframe statements work differently inside class definition

Kevin L
1#
Kevin L Published in 2018-02-12 17:27:41Z

Trying to create a spark-scala class for working with a calendar reference table.

I use sql against a Hadoop database to create a Spark dataframe:

scala> val dfCyccal = sql(sqlCyccal)
dfCyccal: org.apache.spark.sql.DataFrame = [DT_WORKDAY: date, NWKD: int ... 4 more fields]

scala> dfCyccal.printSchema
root
 |-- DT_WORKDAY: date (nullable = true)
 |-- NWKD: integer (nullable = true)
 |-- DT_PREV_WD: date (nullable = true)
 |-- DT_NEXT_WD: date (nullable = true)
 |-- DT_MNTHEND: date (nullable = true)
 |-- ACCTG_MNTH: date (nullable = true)


scala> dfCyccal.show(5)
+----------+----+----------+----------+----------+----------+
|DT_WORKDAY|NWKD|DT_PREV_WD|DT_NEXT_WD|DT_MNTHEND|ACCTG_MNTH|
+----------+----+----------+----------+----------+----------+
|2004-01-29|  20|2003-12-30|2004-02-27|2004-01-29|2004-01-01|
|2004-01-30|   1|2003-12-31|2004-03-02|2004-02-27|2004-02-01|
|2004-02-02|   2|2004-01-02|2004-03-03|2004-02-27|2004-02-01|
|2004-02-03|   3|2004-01-05|2004-03-04|2004-02-27|2004-02-01|
|2004-02-04|   4|2004-01-06|2004-03-05|2004-02-27|2004-02-01|
+----------+----+----------+----------+----------+----------+
only showing top 5 rows

I then set reference constants for the extract:

scala> val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min('DT_WORKDAY)).first().getDate(0)
MIN_DT_WORKDAY: java.sql.Date = 2004-01-29

scala> val MAX_DT_WORKDAY : java.sql.Date = dfCyccal.agg(max('DT_WORKDAY)).first().getDate(0)
MAX_DT_WORKDAY: java.sql.Date = 2020-12-01

Problem is, when I try to encapsulate this in a class definition, I get a different result:

class CYCCAL(parameters for SQL) { 
...
 val dfCyccal = sql(sqlCyccal).persist;

<console>:143: error: not found: value min
val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min('DT_WORKDAY)).first().getDate(0)
                                                  ^
<console>:144: error: not found: value max
val MAX_DT_WORKDAY : java.sql.Date = dfCyccal.agg(max('DT_WORKDAY)).first().getDate(0)

}; // end of CYCCAL

How does the Class setup change the operations on the DataFrame?

user9351164
2#
user9351164 Reply to 2018-02-12 17:30:19Z

They work the same. spark-shell just imports many objects by default including:

import org.apache.spark.sql.functions._

which are missing in your own code.

Kevin L
3#
Kevin L Reply to 2018-02-14 14:36:46Z

This worked. I had to add the following INSIDE the Class definition:

import org.apache.spark.sql.functions.{min,max};

I also had to change the notation on the column from

val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min('DT_WORKDAY)).first().getDate(0)

to

val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min("DT_WORKDAY")).first().getDate(0)

The original was being processed as a Symbol, and you could not apply the functions to it.

You need to login account before you can post.

About| Privacy statement| Terms of Service| Advertising| Contact us| Help| Sitemap|
Processed in 0.299104 second(s) , Gzip On .

© 2016 Powered by mzan.com design MATCHINFO