Skip to main content
Version: Next

GitHub Data Connector

The GitHub Data Connector enables federated SQL queries on various GitHub resources such as files, issues, pull requests, and commits by specifying github as the selector in the from value for the dataset.

Common Configuration​

Configuration​

from​

The from field specifies the GitHub resource to query. The owner and repository name are extracted from the path (e.g., github:github.com/spiceai/spiceai/issues targets the spiceai/spiceai repository). It supports the following formats:

FormatDescription
github:github.com/<owner>/<repo>/files/<ref>Query files from a repository at a specific branch or tag
github:github.com/<owner>/<repo>/issuesQuery issues from a repository
github:github.com/<owner>/<repo>/pullsQuery pull requests from a repository
github:github.com/<owner>/<repo>/commitsQuery commits from a repository
github:github.com/<owner>/<repo>/stargazersQuery stargazers from a repository
github:github.com/<organization>/membersQuery members from an organization

name​

The dataset name. This will be used as the table name within Spice. The dataset name cannot be a reserved keyword.

params​

Personal Access Token​

Parameter NameDescription
github_tokenRequired. GitHub personal access token to use to connect to the GitHub API. Learn more.

GitHub App Installation​

GitHub Apps provide a secure and scalable way to integrate with GitHub's API, and works well when interacting with one or more GitHub organizations. Learn more.

Parameter NameDescription
github_client_idRequired. Specifies the client ID for GitHub App Installation auth mode.
github_private_keyRequired. Specifies the private key for GitHub App Installation auth mode.
github_installation_idRequired. Specifies the installation ID for GitHub App Installation auth mode.

The client ID and private key are generated when creating the GitHub app.

Getting the Installation ID

If the app is installed on a GitHub organization:

  • Visit the settings page for the organization (https://github.com/organizations/<ORG>/settings/installations)
  • Click "Configure" on the app
  • The URL of the page will be of the form https://github.com/organizations/<ORG>/settings/installations/<INSTALLATION_ID>

If the app is installed on a GitHub user:

  • Visit the settings page
  • Click "Configure" on the app
  • The URL of the page will be of the form https://github.com/settings/installations/<INSTALLATION_ID>
Limitations

With GitHub App Installation authentication, the connector's functionality depends on the permissions and scope of the GitHub App. Ensure that the app is installed on the repositories and configured with content, commits, issues and pull permissions to allow the corresponding datasets to work.

Common Parameters​

Parameter NameDescription
github_query_modeOptional. Specifies whether the connector should use the GitHub search API for improved filter performance. Defaults to auto, possible values of auto or search.

Advanced Configuration​

Rate Limiting​

When using multiple GitHub datasets sharing the same GitHub token or GitHub app credentials, it is possible to exceed GitHub's primary and secondary rate limits. To mitigate this, use the github_concurrent_connections_limit setting under runtime.source_rate_control. This connections limit applies per GitHub token and per GitHub app installation, following GitHub's rate limit policy.

Deprecated

runtime.params.github_max_concurrent_connections is deprecated. Use runtime.source_rate_control.github_concurrent_connections_limit instead.

Example Configuration:

# ... other configuration ...
runtime:
source_rate_control:
github_concurrent_connections_limit: 5 # Defaults to 10

datasets:
- from: github:github.com/spiceai/spiceai/files/v0.17.2-beta
name: spiceai.files
params:
github_token: ${secrets:GITHUB_TOKEN}
include: '**/*.txt'
acceleration:
enabled: true
- from: github:github.com/<owner>/<repo>/issues
name: spiceai.issues
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: true
# ... other configuration ...

The GitHub connector supports the following HTTP concurrency parameter:

Parameter NameDescription
max_concurrent_requestsMaximum number of concurrent HTTP requests to the same upstream origin. Overrides runtime.params.http_max_concurrent_requests. If both are unset, concurrency limiting is disabled.

The GitHub connector uses its own rate limiter based on GitHub API X-RateLimit-* response headers. Multiple datasets targeting the same GitHub endpoint share this rate limiter.

Filter Push Down​

GitHub queries support a github_query_mode parameter, which can be set to either auto or search for the following types:

  • Issues: Defaults to auto. Query filters are only pushed down to the GitHub API in search mode.
  • Pull Requests: Defaults to auto. Query filters are only pushed down to the GitHub API in search mode.

Commits only supports auto mode. Query with filter push down is only enabled for the committed_date column. commited_date supports exact matches, or greater/less than matches for dates provided in ISO8601 format, like WHERE committed_date > '2024-09-24'.

When set to search, Issues and Pull Requests will use the GitHub Search API for improved filter performance when querying against the columns:

  • author and state; supports exact matches, or NOT matches. For example, WHERE author = 'peasee' or WHERE author <> 'peasee'.
  • body and title; supports exact matches, or LIKE matches. For example, WHERE body LIKE '%duckdb%'.
  • updated_at, created_at, merged_at and closed_at; supports exact matches, or greater/less than matches with dates provided in ISO8601 format. For example, WHERE created_at > '2024-09-24'.

All other filters are supported when github_query_mode is set to search, but cannot be pushed down to the GitHub API for improved performance.

Limitations
  • GitHub has a limitation in the Search API where it may return more stale data than the standard API used in the default query mode.
  • GitHub has a limitation in the Search API where it only returns a maximum of 1000 results for a query. Use append mode acceleration to retrieve more results over time. See the append example for pull requests.

Examples​

Querying GitHub Files​

Limitations
  • content column is fetched only when acceleration is enabled.
  • Querying GitHub files does not support filter push down, which may result in long query times when acceleration is disabled.
  • Setting github_query_mode to search is not supported.
  • ref - Required. Specifies the GitHub branch or tag to fetch files from.
  • include - Optional. Specifies a pattern to include specific files. Supports glob patterns. If not specified, all files are included by default.
datasets:
- from: github:github.com/<owner>/<repo>/files/<ref>
name: spiceai.files
params:
github_token: ${secrets:GITHUB_TOKEN}
include: '**/*.json; **/*.yaml'
acceleration:
enabled: true

Schema​

Column NameData TypeIs Nullable
nameUtf8YES
pathUtf8YES
refUtf8NO
sizeInt64YES
shaUtf8YES
modeUtf8YES
urlUtf8YES
download_urlUtf8YES
contentUtf8YES

Example​

datasets:
- from: github:github.com/spiceai/spiceai/files/v0.17.2-beta
name: spiceai.files
params:
github_token: ${secrets:GITHUB_TOKEN}
include: '**/*.txt' # include txt files only
acceleration:
enabled: true
sql> select * from spiceai.files
+-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+
| name | path | size | sha | mode | url | download_url | content |
+-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+
| version.txt | version.txt | 12 | ee80f747038c30e776eecb2c2ae155dec9a68187 | 100644 | https://api.github.com/repos/spiceai/spiceai/git/blobs/ee80f747038c30e776eecb2c2ae155dec9a68187 | https://raw.githubusercontent.com/spiceai/spiceai/v0.17.2-beta/version.txt | 0.17.2-beta |
| | | | | | | | |
+-------------+-------------+------+------------------------------------------+--------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+

Time: 0.005067 seconds. 1 rows.

Querying GitHub Issues​

Limitations
  • Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE created_at > '2024-09-24'.
datasets:
- from: github:github.com/<owner>/<repo>/issues
name: spiceai.issues
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: true

Schema​

Column NameData TypeIs Nullable
assigneesList(Utf8)YES
authorUtf8YES
bodyUtf8YES
closed_atTimestampYES
commentsList(Struct)YES
created_atTimestampYES
idUtf8YES
labelsList(Utf8)YES
milestone_idUtf8YES
milestone_titleUtf8YES
comments_countInt64YES
numberInt64YES
stateUtf8YES
titleUtf8YES
updated_atTimestampYES
urlUtf8YES

Example​

datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
params:
github_token: ${secrets:GITHUB_TOKEN}
sql> select title, state, labels from spiceai.issues where title like '%duckdb%'
+-----------------------------------------------------------------------------------------------------------+--------+----------------------+
| title | state | labels |
+-----------------------------------------------------------------------------------------------------------+--------+----------------------+
| Limitation documentation duckdb accelerator about nested struct and decimal256 | CLOSED | [kind/documentation] |
| Inconsistent duckdb connector params: `params.open` and `params.duckdb_file` | CLOSED | [kind/bug] |
| federation across multiple duckdb acceleration tables. | CLOSED | [] |
| Integration tests to cover "On Conflict" behaviors for duckdb accelerator | CLOSED | [kind/task] |
| Permission denied issue while using duckdb data connector with spice using HELM for Kubernetes deployment | CLOSED | [kind/bug] |
+-----------------------------------------------------------------------------------------------------------+--------+----------------------+

Time: 0.011877542 seconds. 5 rows.

Querying GitHub Pull Requests​

Limitations
  • Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE created_at > '2024-09-24'.
datasets:
- from: github:github.com/<owner>/<repo>/pulls
name: spiceai.pulls
params:
github_token: ${secrets:GITHUB_TOKEN}
# Specifies the types of comments to fetch: 'all', 'review', 'discussion', or 'none'. Defaults to 'none'.
github_include_comments: none
# Number of comments to fetch per discussion or review thread.
# Defaults to 25, and is capped at 75
github_max_comments_fetched: 50

Schema​

Column NameData TypeIs Nullable
additionsInt64YES
assigneesList(Utf8)YES
authorUtf8YES
bodyUtf8YES
changed_filesInt64YES
closed_atTimestampYES
comments_countInt64YES
commits_countInt64YES
created_atTimestampYES
deletionsInt64YES
discussionList(Struct(body: Utf8, author: Utf8, created_at: Timestamp))YES
hashesList(Utf8)YES
idUtf8YES
labelsList(Utf8)YES
merged_atTimestampYES
numberInt64YES
review_commentsList(Struct(body: Utf8, author: Utf8, created_at: Timestamp))YES
reviews_countInt64YES
stateUtf8YES
titleUtf8YES
updated_atTimestampYES
urlUtf8YES

Note: The discussion and review_comments columns are only included in the schema when the github_include_comments parameter is set accordingly.

Example​

datasets:
- from: github:github.com/spiceai/spiceai/pulls
name: spiceai.pulls
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: true
sql> select title, url, state from spiceai.pulls where title like '%GitHub connector%'
+---------------------------------------------------------------------+----------------------------------------------+--------+
| title | url | state |
+---------------------------------------------------------------------+----------------------------------------------+--------+
| GitHub connector: convert `labels` and `hashes` to primitive arrays | https://github.com/spiceai/spiceai/pull/2452 | MERGED |
+---------------------------------------------------------------------+----------------------------------------------+--------+

Time: 0.034996667 seconds. 1 rows.

Append Example​

datasets:
- from: github:github.com/spiceai/spiceai/pulls
name: spiceai.pulls
params:
github_token: ${secrets:GITHUB_TOKEN}
github_query_mode: search
time_column: created_at
acceleration:
enabled: true
refresh_mode: append
refresh_check_interval: 6h # check for new results every 6 hours
refresh_data_window: 90d # at initial load, load the last 90 days of pulls

Comments Example​

datasets:
- from: github:github.com/spiceai/spiceai/pulls
name: spiceai.pulls
params:
github_token: ${secrets:GITHUB_TOKEN}
github_include_comments: all
github_max_comments_fetched: 75
acceleration:
enabled: true
sql> select unnest(unnest(review_comments)) from spiceai.pulls where number = 6 limit 1;
+------------------------------------------------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------+
| __unnest_placeholder(UNNEST(spiceai.pulls.review_comments)).body | __unnest_placeholder(UNNEST(spiceai.pulls.review_comments)).created_at | __unnest_placeholder(UNNEST(spiceai.pulls.review_comments)).author |
+------------------------------------------------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------+
| Nitpick - extra space. | 2021-08-11T17:36:23 | haardvark |
+------------------------------------------------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------+

Time: 0.034283334 seconds. 1 rows.
sql> select unnest(unnest(discussion)) from spiceai.pulls where number = 148 limit 1;
+-------------------------------------------------------------+-------------------------------------------------------------------+---------------------------------------------------------------+
| __unnest_placeholder(UNNEST(spiceai.pulls.discussion)).body | __unnest_placeholder(UNNEST(spiceai.pulls.discussion)).created_at | __unnest_placeholder(UNNEST(spiceai.pulls.discussion)).author |
+-------------------------------------------------------------+-------------------------------------------------------------------+---------------------------------------------------------------+
| Do not merge until after repo goes public. | 2021-09-06T08:00:45 | lukekim |
+-------------------------------------------------------------+-------------------------------------------------------------------+---------------------------------------------------------------+

Time: 0.036530584 seconds. 1 rows.

Querying GitHub Commits​

Limitations
  • Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE committed_date > '2024-09-24'.
  • Setting github_query_mode to search is not supported.
datasets:
- from: github:github.com/<owner>/<repo>/commits
name: spiceai.commits
params:
github_token: ${secrets:GITHUB_TOKEN}

Schema​

Column NameData TypeIs Nullable
additionsInt64YES
associated_pull_request_numberInt64YES
author_emailUtf8YES
author_nameUtf8YES
changed_filesInt64YES
committed_dateTimestampYES
committer_dateTimestampYES
committer_emailUtf8YES
committer_nameUtf8YES
deletionsInt64YES
idUtf8YES
messageUtf8YES
message_bodyUtf8YES
message_head_lineUtf8YES
refUtf8YES
shaUtf8YES
statusUtf8YES

Example​

datasets:
- from: github:github.com/spiceai/spiceai/commits
name: spiceai.commits
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: true
sql> select sha, message_head_line from spiceai.commits limit 10
+------------------------------------------+------------------------------------------------------------------------+
| sha | message_head_line |
+------------------------------------------+------------------------------------------------------------------------+
| 2a9fab7905737e1af182e17f40aecc5c4b5dd236 | wait 2 seconds for the status to turn ready in refreshing status tes… |
| b9c210a818abeaf14d2493fde5227781f47faed8 | Update README.md - Remove bigquery from tablet of connectors (#1434) |
| d61e1af61ebf826f83703b8dd939f19e8b2ba426 | Add databricks_use_ssl parameter (#1406) |
| f1ec55c5986e3e5d57eff94197182ffebbae1045 | wording and logs change reflected on readme (#1435) |
| bfc74185584d1e048ef66c72ce3572a0b652bfd9 | Update acknowledgements (#1433) |
| 0d870f1791d456e7924b4ecbbda5f3b762db1e32 | Update helm version and use v0.13.0-alpha (#1436) |
| 12f930cbad69833077bd97ea43599a75cff985fc | Enable push-down federation by default (#1429) |
| 6e4521090aaf39664bd61d245581d34398ce77db | Add functional tests for federation push-down (#1428) |
| fa3279b7d9fcaa5e8baaa2425f69b556bb30e309 | Add LRU cache support for http-based sql queries (#1410) |
| a3f93dde9d1312bfbf14f7ae3b75bdc468289212 | Add guides and examples about error handling (#1427) |
+------------------------------------------+------------------------------------------------------------------------+

Time: 0.0065395 seconds. 10 rows.

Querying GitHub stars (Stargazers)​

Limitations
  • Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE starred_at > '2024-09-24'.
  • Setting github_query_mode to search is not supported.
datasets:
- from: github:github.com/<owner>/<repo>/stargazers
name: spiceai.stargazers
params:
github_token: ${secrets:GITHUB_TOKEN}

Schema​

Column NameData TypeIs Nullable
starred_atTimestampYES
loginUtf8YES
emailUtf8YES
nameUtf8YES
companyUtf8YES
x_usernameUtf8YES
locationUtf8YES
avatar_urlUtf8YES
bioUtf8YES

Example​

datasets:
- from: github:github.com/spiceai/spiceai/stargazers
name: spiceai.stargazers
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: true
sql> select starred_at, login from spiceai.stargazers order by starred_at DESC limit 10
+----------------------+----------------------+
| starred_at | login |
+----------------------+----------------------+
| 2024-09-15T13:22:09Z | cisen |
| 2024-09-14T18:04:22Z | tyan-boot |
| 2024-09-13T10:38:01Z | yofriadi |
| 2024-09-13T10:01:33Z | FourSpaces |
| 2024-09-13T04:02:11Z | d4x1 |
| 2024-09-11T18:10:28Z | stephenakearns-insta |
| 2024-09-09T22:17:42Z | Lrs121 |
| 2024-09-09T19:56:26Z | jonathanfinley |
| 2024-09-09T07:02:10Z | leookun |
| 2024-09-09T03:04:27Z | royswale |
+----------------------+----------------------+

Time: 0.0088075 seconds. 10 rows.

Querying Members of a GitHub Organization​

Limitations
  • Querying with filters using date columns requires the use of ISO8601 formatted dates. For example, WHERE created_at > '2024-09-24'.
  • Setting github_query_mode to search is not supported.
datasets:
- from: github:github.com/<organization>/members
name: members
params:
github_token: ${secrets:GITHUB_TOKEN}

Schema​

Column NameData TypeIs Nullable
usernameUtf8YES
nameUtf8YES
avatar_urlUtf8YES
urlUtf8YES
emailUtf8YES
locationUtf8YES
companyUtf8YES
created_atTimestampYES
bioUtf8YES

Example​

datasets:
- from: github:github.com/apache/members
name: apache.members
params:
github_token: ${secrets:GITHUB_TOKEN}
acceleration:
enabled: true
sql> select created_at, username from apache.members order by created_at desc limit 10;
+---------------------+-------------------+
| created_at | username |
+---------------------+-------------------+
| 2023-10-09T13:14:13 | heliang666s |
| 2023-04-14T11:26:44 | cortlepp |
| 2023-02-16T08:28:58 | ChengJie1053 |
| 2023-02-11T03:51:52 | FinalT |
| 2022-11-20T12:12:56 | Yanshuming1 |
| 2022-10-10T23:29:29 | bernardodemarco |
| 2022-10-07T05:06:37 | coldgust |
| 2022-09-06T14:38:44 | No-SilverBullet |
| 2022-08-18T13:31:44 | harshithasudhakar |
| 2022-07-05T10:44:08 | bearslyricattack |
+---------------------+-------------------+

Time: 0.054390375 seconds. 10 rows.

Cookbook​