Kafka connect many to many tables in MSSQL

I'm currently looking into Kafka Connect to stream some of our databases to a data lake. To test out Kafka Connect I've setup a database with one of our project databases in. So far so good.

Next step I configured Kafka Connect with mode following properties:

"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "timestamp.column.name": "updated_at,created_at", "incrementing.column.name": "id", "dialect.name": "SqlServerDatabaseDialect", "validate.non.null": "false", "tasks.max": "1", "mode": "timestamp+incrementing", "topic.prefix": "mssql-jdbc-", "poll.interval.ms": "10000",

While this works for the majority of my tables where I got an ID and a created_at / updated_at field, it won't work for my tables where I solved my many-to-many relationships with a table in between and a composite key. Note that I'm using the generic JDBC configuration with a JDBC driver from Microsoft.

Is there a way to configure Kafka Connect for these special cases?

Have you tried using the query configuration options to write a join, for example? Otherwise, the documented Kafka pattern is using Kafka Streams to join your "lookup table" topics with the "data table" topics

– cricket_007
Sep 9 '18 at 15:23

query

@cricket_007 I don't quite understand. I'm under the impression that the query parameter can be used when you want to track specific queries. In my case I want to track the complete database. It could be that I'm wrong though. :-)

– Vince V.
Sep 10 '18 at 6:51

The incrementing timestamp options are also "specific queries", no I'm not sure I understand your concern

– cricket_007
Sep 10 '18 at 13:04

1 Answer
1

Instead of one connector to pull all of your tables, you may need to create multiple ones. This would be the case if you want to use different methods for fetching the data, or different ID/timestamp columns.
As @cricket_007 says, you can use the query option to pull back the results of a query—which could be a SELECT expressing your multi-table join. Even when pulling data from a single table object, the JDBC connector itself is just issuing a SELECT * from the given table, with a WHERE predicate to restrict the rows selected based on the incrementing ID/timestamp.

query

SELECT

SELECT *

WHERE

The alternative is to use log-based change data capture (CDC), and stream all changes directly from the database into Kafka.

Whether you use JDBC or log-based CDC, you can use stream processing to resolve joins in Kafka itself. An example of this is Kafka Streams or KSQL. I've written about the latter a lot here.

You might also find this article useful describing in detail your options for integrating databases with Kafka.

Disclaimer: I work for Confluent, the company behind the open-source KSQL project.

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt