Skip to main content

A New Class of Applications That Learn and Adapt

ยท 5 min read
Phillip LeBlanc
Co-Founder and CTO of Spice AI

A new class of applications that learn and adapt is becoming possible through machine learning (ML). These applications learn from data and make decisions to achieve the application's goals. In the post Making apps that learn and adapt, Luke described how developers integrate this ability to learn and adapt as a core part of the application's logic. You can think of the component that does this as a "decision engine." This post will explore a brief history of decision engines and use-cases for this application class.

History of decision enginesโ€‹

The idea to make intelligent decision-making applications is not new. Developers first created these applications around the 1970s1, and they are some of the earliest examples of using artificial intelligence to solve real-world problems.

The first applications used a class of decision engines called "expert systems." A distinguishing trait of expert systems is that they encode human expertise in rules for decision-making. Domain experts created combinations of rules that powered decision-making capabilities.

Some uses of expert systems include:

However, the resources required to build expert systems make employing them infeasible for many applications2. They often need a significant time and resource investment to capture and encode expertise into complex rule sets. These systems also do not automatically learn from experience, relying on experts to write more rules to improve decision-making.

With the advent of modern deep-learning techniques and the ability to access significantly more data, it is now possible for the computer, not only the developer, to learn and encode the rules to power a decision engine and improve them over time. The vision for Spice.ai is to make it easy for developers to build this new class of applications. So what are some use-cases for these applications?

Use cases of decision-making applicationsโ€‹

Reduce energy costs by optimizing air conditioningโ€‹

Today: The air conditioning system for an office building runs on a fixed schedule and is set to a fixed temperature in business hours, only adjusting using in-room sensor data, if at all. This behavior potentially over cools at business close as the outside temperature lowers and the building starts vacating.

With Spice.ai: Using Spice.ai, the application combines time-series data from multiple data sources, including the time of day and day of the week, building/room occupancy, and outside temperature, energy consumption, and pricing. The A/C controller application learns how to adjust the air conditioning system as the room naturally cools towards the end of the day. As the occupancy decreases, the decision engine is rewarded for maintaining the desired temperature and minimizing energy consumption/cost.

Food delivery order dispatchingโ€‹

Today: Customers order food delivery with a mobile app. When the order is ready to be picked up from the restaurant, the order is dispatched to a delivery driver by a simple heuristic that chooses the nearest available driver. As the app gets more popular with customers and the number of restaurants, drivers, and customers increases, the heuristic needs to be constantly tuned or supplemented with human operators to handle the demand.

With Spice.ai: The application learns which driver to dispatch to minimize delivery time and maximize customer star ratings. It considers several factors from data, including patterns in both the restaurant and driver's order histories. As the number of users, drivers, and customers increases over time, the app adapts to keep up with the changing patterns and demands of the business.

Routing stock or crypto trades to the best exchangeโ€‹

Today: When trading stocks through a broker like Fidelity or TD Ameritrade, your broker will likely route your order to an exchange like the NYSE. And in the emerging world of crypto, you can place your trade or swap directly on a decentralized exchange (DEX) like Uniswap or Pancake Swap. In both cases, the routing of orders is likely to be either a form of traditional expert system based upon rules or even manually routed.

With Spice.ai: A smart order routing application learns from data such as pending transactions, time of day, day of the week, transaction size, and the recent history of transactions. It finds patterns to determine the most optimal route or exchange to execute the transaction and get you the best trade.

Summaryโ€‹

A new class of applications that can learn and adapt are made possible by integrating AI-powered decision engines. Spice.ai is a decision engine that makes it easy for developers to build these applications.

If you'd like to partner with us in creating this new generation of intelligent decision-making applications, we invite you to join us on Discord, reach out on Twitter or email us.

Phillip

Footnotesโ€‹

  1. Russell, Stuart; Norvig, Peter (1995). Artificial Intelligence: A Modern Approach. Simon & Schuster. pp. 22โ€“23. ISBN 978-0-13-103805-9. โ†ฉ

  2. Kendal, S. L., & Creen, M. (2007). An introduction to knowledge engineering. London: Springer. ISBN 978-1-84628-475-5 โ†ฉ

Spice.ai v0.5.1-alpha

ยท 3 min read
Phillip LeBlanc
Co-Founder and CTO of Spice AI

Announcing the release of Spice.ai v0.5.1-alpha! ๐Ÿ“ˆ

This minor release builds upon v0.5-alpha adding the ability to start training from the dashboard plus support for monitoring training runs with TensorBoard.

Highlights in v0.5.1-alphaโ€‹

Start training from dashboardโ€‹

A "Start Training" button has been added to the pod page on the dashboard so that you can easily start training runs from that context.

Training runs can now be started by:

  • Modifications to the Spicepod YAML file.
  • The spice train <pod name> command.
  • The "Start Training" dashboard button.
  • POST API calls to /api/v0.1/pods/{pod name}/train

TensorBoard monitoringโ€‹

TensorBoard monitoring is now supported when using DQL (default) or the new SACD learning algorithms that was announced in v0.5-alpha.

When enabled, TensorBoard logs will automatically be collected and a "Open TensorBoard" button will be shown on the pod page in the dashboard.

Logging can be enabled at the pod level with the training_loggers pod param or per training run with the CLI --training-loggers argument.

Support for VPG will be added in v0.6-alpha. The design allows for additional loggers to be added in the future. Let us know what you'd like to see!

New in this releaseโ€‹

  • Adds a start training button on the dashboard pod page.
  • Adds TensorBoard logging and monitoring when using DQL and SACD learning algorithms.

Dependency updatesโ€‹

  • Updates to Tailwind 3.0.6
  • Updates to Glide Data Grid 3.2.1

Resourcesโ€‹

Communityโ€‹

Spice.ai started with the vision to make AI easy for developers. We are building Spice.ai in the open and with the community. Reach out on Discord or by email to get involved. We will also be starting a community call series soon!

Understanding Q-learning: How a Reward Is All You Need

ยท 11 min read
Corentin Risselin
Software Engineer at Spice AI

There are two general ways to train an AI to match a given expectation: we can either give it the expected outputs (commonly named labels) for differents inputs; we call this supervised learning. Or we can provide a reward for each output as a score: this is reinforcement learning (RL).

Supervised learning works by tweaking all the parameters (weights in neural networks) to fit the desired outputs, expecting that given enough input/label pairs the AI will find common rules that generalize for any input.

Reinforcement learning's reward is often provided from a simple function that can score any output: we don't know what specific output would be best, but we can recognize how good the result is. In this latter statement there are two underlying concepts we will address in this post:

  • Can we only tell if the output is good in a binary way, or do we have to quantify the output to train our AI?
  • Do we have to give a reward for every AI's output? Can we give a reward only at specific times?

Those questions are already mostly answered, and many algorithms deal with those topics. Our journey here will be to understand how we tackle those questions and end up with a beautiful formula that is at the core of modern approaches of RL:

Equation 1. Q estimation at the heart of many RL algorithm, also known as the Bellman equation.

Q-learningโ€‹

The vast majority, if not all, of modern RL algorithms are based on the principles of Q-learning: the idea is to evaluate a 'reward expectation' for each possible action. If we can have a good evaluation, we could maximize the reward by choosing actions with the maximum evaluated rewards. The function giving this expected reward is named Q. For now, we will assume we can have a reward for any action.

Equation 2. Definition of the Q function.

The t indices show that the state and action aren't constant and will vary, usually with time/action taken. On the other hand, the Q function and the reward function r are unique functions that ideally return the 'expected reward' for any (state, action) pairs.

For now, we will assume we can have a reward that gives an objective and perfect evaluation of each state/action.

Figure 1. Example of reward given for different actions at a specific state. Here a simple 2D map with a goal.

Q-Tableโ€‹

We know that actions' outcomes (rewards) will vary depending on the current state we are in, otherwise the problem would be trivial to solve. If the states that are relevant to our actions can be numbered, a simple way would be to build a table with all the possible states/action pairs. There are different ways to build such a table depending on how we can interact with our environment. Eventually, we would have a good 'map' to guide us to do the best actions.

Figure 2. Example of Q-table: we can build an exhaustive table for all the possible (state, action) pairs

Deep Q-Learningโ€‹

When the number of variables of the environment relevant to our actions/rewards becomes too large, the number of possible states grows quickly. It doesn't take a lot of possible parameters to make the Q-table approach unfeasible. Neural networks are known to work very nicely and efficiently in high dimensionality (with many input variables). They also generalize well, so the idea in Deep Q-Learning is to use a neural network to predict the different Q values for each action given a state.

Figure 3. A neural network can predict Q values from state information

In this case, we do not need to give the state/action pairs but only the state, as the neural network would exhaustively return all the Q values associated with each action. Outputting all actions' Q value is a common method as the general cases have a complex environment but a smaller number of possible actions.

This method works very well. It is similar to supervised learning with states as inputs and rewards as labels. We assumed so far that we had a reward for each action, and we chose the next action with the best reward (called a greedy policy). In many cases this is not enough: even if an action would yield the best reward at a given state, this may affect the next state so that we wouldn't optimize the reward in the long term. Also, if we can't have a reward for each action, we usually give 0 as a reward. We will not be able to choose the right action if they affect later states despite not yielding different rewards at the current state.

The sparsity of rewards or the long-term calculation of total reward (non-greedy policies) leads us to diverge from supervised learning and learn potential future rewards.

Temporal difference: TD-Learningโ€‹

TD-learning is a clever way to account for potential future value without knowing them yet. TD is a model-free class of algorithms: it does not simulate future states. The main idea is to consider all the rewards of a sequence of actions to give a better value than just the reward of the next action.

We can, for instance, sum all the future rewards:

Figure 4. Cumulating future rewards to assign values to each state.

Mathematically this can be written as:

Equation 3.

This is named TD(0): the simplest form of TD method, accumulating all the rewards.

Introducing policiesโ€‹

We could try different trajectories (sequence of actions) and retrospectively get the final reward for each action, but this has 2 drawbacks: the environment is usually too vast, and the sequence of actions might not even have a definite end. Also, such exhaustive methods might not be very efficient. Instead, we can evaluate the 'value' of the next state overall, like the maximum of all its possible rewards (direct reward), and add this value to the reward of a given action.

If a state can have different branches, we can select the best one, and this would be our policy, the way we choose actions. This simple form of taking the maximum is called the 'greedy' policy.

Figure 5. With a greedy policy the associated values to state come from the maximum value of the next state. Here despite the lower branch giving only half the top reward directly the overall value is greater.

This can be written down as:

Equation 4.

The expected value notation is defined as:

Equation 5.

For a greedy policy the probabilities p would all be set to 0 but the one associated with the highest return to 1 (in case of equality between n actions, we would attribute '1/n' as probabilities to get the same expected value).

Equation 6.

Relation with Q functionโ€‹

The expected reward can be replaced by the Q function we used earlier, which now can be denominated to be specific to our chosen policy (named ฯ€):

Equation 7.

TD-0โ€‹

We previously discussed the problem of not being able to go through all the states exhaustively and that the evaluation of the Q value from a neural network could help. We want to use the TD method to have a better value estimation that will consider potential future rewards.

The TD(0) method is elegant as we can, in fact, only use the next state's expected value instead of all future ones. The idea is that with successive evaluations, we build a chain of dependencies as each states' value depends on the next one.

Equation 8.

Figure 6. Iterative propagation of state values following TD(0) method.

We can see that the greedy policy would work even with null rewards in the trajectory. We can explicit our greedy policy, going back to use Q value instead of the state value V:

Equation 9.

TD-lambdaโ€‹

We need to fix a problem: if a trajectory grows too long or never ends, a state value can potentially grow indefinitely. To counter that, we can add a discount factor (originally named lambda, usually refer as gamma in Q-learning) for the next state's value:

Equation 10.

Notice that we simplify the reward notation for clarity.

To avoid exploding values, this discount has to be between 0 and 1 (strictly below 1). We can think about it as giving more importance to the direct reward than the future ones. As the contribution to the latter reward decrease, the chain of action can grow without the calculated value growing. If the reward has an upper limit, the value will also be bounded.

The sparsity of rewards is also solved: giving only a positive reward after many non-rewarding steps will create smooth values for the intermediate states. Any reward, positive or negative, will diffuse its value to the neighbor states.

Figure 7. The TD(0) value propagation can allow for a smooth value distribution over the state that will help building efficient behaviour.

Q-Learning algorithmโ€‹

Finally, as we train a neural network to estimate the Q function, we need to update its target with successive iteration. We cannot fully trust the estimator (a neural network here) to give the correct value, so we introduce a learning rate to update the target smoothly.

Equation 11. Fully explained Bellman equation.

That is it! We now understand all the parts of this formula. Over multiple training steps with different sates, the training should find a good average Q function. While training, the estimator uses its own output to train itself (commonly referred to as bootstrapping): it is like it is chasing itself. Bootstrapping can lead to instability in the training process. There are many additional methods to help against such instability.

From giving rewards, sparse or not, binary or fine-grained, we have a smooth space of values for all our states/actions so the AI can follow a greedy policy to the best outcome.

This way of training is not a silver bullet and there is no guarantee that the AI will find a correlation from the information given as state to the returned reward.

Conclusionโ€‹

We can see how our rewards are used to train AI's policies using Q-learning. By understanding the many iterations required and the bootstrapping issues, we can help our AI by carefully giving relevant state information and reward:

  • There needs to be a correlation between the state information and the reward: the simpler the relationship, the easier/faster the AI will find it.
  • Sparse and binary rewards make the training problem long and arduous. Giving more information through the reward can tremendously increase the speed/accuracy of the learned Q-estimator.
  • The longer the chain of actions, the more complex the Q-value will be to estimate.

We didn't see how the AI's algorithm can explore different actions given an environment here. Spice.ai's technology focuses exclusively on off-policy training where we only have past data and cannot interact with the environment. RL is a vast topic and currently quickly growing. Robotics is a fantastic field of application; many other areas are yet to be explored with such a technology. We hope to push forward the technology and its field of application with our platform.

If you'd like to partner with us on the mission of making new applications by leveraging RL, we invite you to discuss with us on Discord, reach out on Twitter or email us.

I hope you enjoy this post and learn new things.

Corentin

Spice.ai v0.5-alpha

ยท 3 min read
Phillip LeBlanc
Co-Founder and CTO of Spice AI

We are excited to announce the release of Spice.ai v0.5-alpha! ๐Ÿฅ‡

Highlights include a new learning algorithm called "Soft Actor-Critic" (SAC), fixes to the behavior of spice upgrade, and a more consistent authoring experience for reward functions.

If you are new to Spice.ai, check out the getting started guide and star spiceai/spiceai on GitHub.

Highlights in v0.5-alphaโ€‹

Soft Actor-Critic (Discrete) (SAC) Learning Algorithmโ€‹

The addition of the Soft Actor-Critic (Discrete) (SAC) learning algorithm is a significant improvement to the power of the AI engine. It is not set as the default algorithm yet, so to start using it pass the --learning-algorithm sacd parameter to spice train. We'd love to get your feedback on how its working!

Consistent reward authoring experienceโ€‹

With the addition of the reward function files that allow you to edit your reward function in a Python file, the behavior of starting a new training session by editing the reward function code was lost. With this release, that behavior is restored.

In addition, there is a breaking change to the variables used to access the observation state and interpretations. This change was made to better reflect the purpose of the variables and make them easier to work with in Python

Previous (Type)New (Type)
prev_state (SimpleNamespace)current_state (dict)
prev_state.interpretations (list)current_state_interpretations (list)
new_state (SimpleNamespace)next_state (dict)
new_state.interpretations (list)next_state_interpretations (list)

Improved spice upgrade behaviorโ€‹

The Spice.ai CLI will no longer recommend "upgrading" to an older version. An issue was also fixed where trying to upgrade the Spice.ai CLI using spice upgrade on Linux would return an error.

New in this releaseโ€‹

  • Adds a new learning algorithm called "Soft-Actor Critic" (SAC).
  • Updates the reward function parameters for the YAML code blocks from prev_state and new_state to current_state and next_state to be consistent with the reward function files.
  • Fixes an issue where editing a reward functions file would not automatically trigger training.
  • Fixes the normalization of values for the Deep-Q Learning algorithm to handle larger values.
  • Fixes an issue where the Spice.ai CLI would not upgrade on Linux with the spice upgrade command.
  • Fixes an issue where the Spice.ai CLI would recommend an "upgrade" to an older version.

Resourcesโ€‹

Communityโ€‹

Spice.ai started with the vision to make AI easy for developers. We are building Spice.ai in the open and with the community. Reach out on Discord or by email to get involved. We will also be starting a community call series soon!

AI needs AI-ready data

ยท 6 min read
Phillip LeBlanc
Co-Founder and CTO of Spice AI

A significant challenge when developing an app powered by AI is providing the machine learning (ML) engine with data in a format that it can use to learn. To do that, you need to normalize the numerical data, one-hot encode categorical data, and decide what to do with incomplete data - among other things.

This data handling is often challenging! For example, to learn from Bitcoin price data, the prices are better if normalized to a range between -1 and 1. Being close to 0 is also a problem because of the lack of precision in floating-point representations (usually under 1e-5).

As a developer, if you are new to AI and machine learning, a great talk that explains the basics is Machine Learning Zero to Hero. Spice.ai makes the process of getting the data into an AI-ready format easy by doing it for you!

What is AI-ready data?โ€‹

You write code with if statements and functions, but your machine only understands 1s and 0s. When you write code, you leverage tools, like a compiler, to translate that human-readable code into a machine-readable format.

Similarly, data for AI needs to be translated or "compiled" to be understood by the ML engine. You may have heard of tensors before; they are simply another word for a multi-dimensional array and they are the language of ML engines. All inputs to and all outputs from the engine are in tensors. You could use the following techniques when converting (or "compiling") source data to a tensor.

  1. Normalization/standardization of the numerical input data. Many of the inputs and outputs in machine learning are interpreted as probability distributions. Much of the math that powers machine learning, such as softmax, tanh, sigmoid, etc., is meant to work in the [-1, 1] range.

Normalizing raw data Figure 1. Normalizing Bitcoin price data.

  1. Conversion of categorical data into numerical data. For categorical data (i.e., colors such as "red," "blue," or "green"), you can achieve this through a technique called "One Hot Encoding." In one hot encoding, each possible value in the category appears as a column. The values in the column are assigned a binary value of 1 or 0 depending on whether the value exists or not.

Figure 2. A visualization of one-hot encoding Figure 2. A visualization of one-hot encoding.

  1. Several advanced techniques exist for "compiling" this source data - this process is known in the AI world as "feature engineering." This article goes into more detail on feature engineering techniques if you are interested in learning more.

There are excellent tools like Pandas, Numpy, scipy, and others that make the process of data transformation easier. However, most of these tools are Python libraries and frameworks - which means having to learn Python if you don't know it already. Plus, when building intelligent apps (instead of just doing pure data analysis), this all needs to work on real-time data in production.

Building intelligent appsโ€‹

The tools mentioned above are not designed for building real-time apps. They are often designed for analytics/data science.

In your app, you will need to do this data compilation in real-time - and you can't rely on a local script to help process your data. It becomes trickier if the team responsible for the initial training of the machine learning model is not the team responsible for deploying it out into production.

How data is loaded and processed in a static dataset is likely very different from how the data is loaded and processed in real-time as your app is live. The result often is two separate codebases that are maintained by different teams that are both responsible for doing the same thing! Ensuring that those codebases stay consistent and evolve together is another challenge to tackle.

Spice.ai helps developers build apps with real-time MLโ€‹

Spice.ai handles the "compilation" of data for you.

You specify the data that your ML should learn from in a Spicepod. The Spice.ai runtime handles the logistics of gathering the data and compiling it into an AI-ready format.

It does this by using many techniques described earlier, such as normalization and one-hot encoding. And because we're continuing to evolve Spice.ai, our data compilation will only get better over time.

In addition, the design of the Spice.ai runtime naturally ensures that the data used for both the training and real-time cases are consistent. Spice.ai uses the same data-components and runtime logic to produce the data. And not only that, you can take this a step further and share your Spicepod with someone else, and they would be able to use the same AI-ready data for their applications.

Summaryโ€‹

Spice.ai handles the process of compiling your data into an AI-ready format in a way that is consistent both during the training and real-time stages of the ML engine. A Spicepod defines which data to get and where to get it. Sharing this Spicepod allows someone else to use the same AI-ready data format in their application.

Learn more and contributeโ€‹

Building intelligent apps that leverage AI is still way too hard, even for advanced developers. Our mission is to make this as easy as creating a modern web page. If the vision resonates with you, join us!

Our Spice.ai Roadmap is public, and now that we have launched, the project and work are open for collaboration.

If you are interested in partnering, we'd love to talk. Try out Spice.ai, email us "hey," get in touch on Discord, or reach out on Twitter.

We are just getting started! ๐Ÿš€

Phillip