Skip to content

Spark 2.3.1#13

Open
polomarcus wants to merge 1 commit into
masterfrom
2.3.1
Open

Spark 2.3.1#13
polomarcus wants to merge 1 commit into
masterfrom
2.3.1

Conversation

@polomarcus

Copy link
Copy Markdown
Owner

Spark 2.2.0 to 2.3.1

Need to update Cassandra Sink

@cranberrysoft

Copy link
Copy Markdown

I guess it will not work. I tried to upgrade spark to 2.3.1 and it started returning such an error:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;

btw. there is already sink for Cassandra in DSE 6. I am wondering when(if) they port that solution to cassandra driver. The sink I mentioned exists in spark-connector-6.0.2.jar CassandraSourceRelation. I tested that in stand-alone DSE and it works with .outputMode("update").
It is a pity that community can not use that solution for free...

@polomarcus

polomarcus commented Aug 3, 2018 via email

Copy link
Copy Markdown
Owner Author

@cranberrysoft

Copy link
Copy Markdown

Hi Paul
I'd love to help you with development of sink for Cassandra especially that it is not going to be included to the open-source driver as you said. Please let me know how I can reach you if you need any help in this matter.

@polomarcus

Copy link
Copy Markdown
Owner Author

I would have a look to the elastic sink, which is open source, and see their implementation to be inspired.
Hopefully, we just need to change import (DatasourceV2 or something like that) but it can also be, rewrite the sink to be 2.3 compliant and it may take some time :/

We also have the foreach sink that can be used with Cassandra. I refer to it as "unsafe" in the repo

@cranberrysoft

cranberrysoft commented Aug 4, 2018

Copy link
Copy Markdown

I thought also about foreach sink but it has two downsides. First of all it does not support this stateful transaminations which are the key things when it comes to Structured Streaming. Secondly I believe that this solution is not really optimal since it use low level API to save data to Cassandra and you operate on a row so probably all the under-hood optimization which are done by the driver is lost. I am pretty sure that you saw one of Russel videos about the Cassandra driver https://www.youtube.com/watch?v=cKIHRD6kUOc

I also tried to find an inspiration in DSE implementation unfortunately it's not opensource and it is Scala code so you can not easily decompile the code ;) but I will also try to dig a little bit to understand the way it should have been implemented.

@redsk

redsk commented Aug 23, 2018

Copy link
Copy Markdown

Hi guys, I'm also interested in this and I'd love to help you with development. Please let me know how I can contact you for this effort. Cheers

@snowch

snowch commented Aug 28, 2018

Copy link
Copy Markdown
Contributor

See also: scylladb/scylla-code-samples#67 (comment)

@polomarcus

Copy link
Copy Markdown
Owner Author

Thanks for all your messages 😄

If you feel like give it a try, the offical Elastic sink can be a great source of inspiration for the Cassandra sink

Compared to what we have in the repo :

I might be able to spend some time on the issue the following month.

@snowch

snowch commented Sep 20, 2018

Copy link
Copy Markdown
Contributor

Looks like there is some useful stuff in here: scylladb/scylla-code-samples#68

@polomarcus

Copy link
Copy Markdown
Owner Author

thanks @snowch Scylla does it the same way by using the Datastax's connector : https://github.com/scylladb/scylla-code-samples/pull/68/files#diff-1e869081fec2d3c842a3b91688825a5eR71

I'm guessing it should be a small fix to be able to have the project running for spark 2.3.1 and the cassandra sink

@snowch

snowch commented Oct 26, 2018

Copy link
Copy Markdown
Contributor

@polomarcus are you planning to implement the fix you suggested above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants