With the newest version of dask-sql it is possible to use the huge Python ML ecosystem in SQL

Photo by Christopher Burns on Unsplash

Sometimes it is hard to believe that a world before ML existed. So many modern data analyses are built on top of ML techniques and will continue to do so in the foreseeable future. However, not everyone is able to benefit from these vast advances, because using ML techniques mostly involves using Python, developing code, and understanding many new technologies. Especially when Big Data and distributed systems enter the game, things get messy.

This is a problem that SQL query engines are trying to solve. They make it possible to use the complicated world of distributed systems with simple, yet…


Learn how to deploy a Python SQL Engine to your k8s cluster and run complex Python functions from SQL

Photo by Fitore F on Unsplash

Even the best data does not matter if no one can access it. Therefore, it is crucial to make data accessible to everyone, without the need for them to learn complicated programming languages. This is a problem, that SQL query engines try to solve. They allow everyone with some SQL knowledge to get access to data stored in typical big data systems.

Problem solved? Maybe not. Because in many cases, your data does not live in a nice-to-access data lake like S3. And the computations you want to apply to it are machine learning algorithms or other (typically Python-based) complex…


If big is not big enough

Photo by Nathan Anderson on Unsplash

In the last post, we have explored how tsfresh automatically extracts many time-series features from your input data. We have also discussed two possibilities to speed up your feature extraction calculation: using multiple cores on your local machine (which is already turned on by default) or distributing the calculation over a cluster of machines.

In this post we will go one step further: what happens if your data is so large that loading the data into your scheduling machine is not an option anymore? For many applications, this is not the case (and keeping everything local speeds up the development…


Making Sense of Big Data

Sometimes, you just need to go big

Photo by Chris Liverani on Unsplash

Time Series data is everywhere today. From stock market trends to EEG measurements, from Industry 4.0 production lines to IoT sensors — temporarily annotated data hides in a lot of places. It comes in many flavors, sizes, and complexities.

In this series of two posts, we will explore how we can extract features from time series using tsfresh - even when the time series data is very large and the computation takes a very long time on a single core.

But first, let’s define some common properties of time series data:

  • The data is indexed by some discrete “time” variable…


In this post, we will build a real-time feature extraction pipeline for time series data.

Let the data flow as the sand in this picture. Photo by George Fitzmaurice on Unsplash

Introduction

As sensors get cheaper and smaller, the amount of data collected around us increases tremendously. Finding good algorithms and techniques to dig through this flood of data is key here. In this post, we will look into a very typical example: sensor data time series classification. Think of predictive maintenance (you want to classify if the machine the sensor is monitoring will break soon), earthquake prediction (is this seismogram signal likely to come from a starting earthquake?), or Industry 4.0 applications (the sensor data can be used to decide if the quality of the produced piece is good enough).

But…


The result of the action we will build in this blog post.

Introduction

GitHub actions are the new CI/CD tool and they come with some very nice benefits: simple workflow file syntax, good documentation, very nice integration with GitHub (obviously), and — most importantly — the possibility to write and share created actions easily.

There exist a large marketplace of community-written actions, but even if you do not find what you are looking for: it is very easy to write your own. In this blog post, we will walk through the process of writing, testing, and publishing a simple GitHub action written in TypeScript from scratch. Its task will be to read in…


How to get the best of all worlds

This post will describe what an SQL Query Engine is and how you can use dask-sql to analyze your (big) data quickly and easily and also call complex algorithms, such as machine learning, from SQL.

Photo by Moritz Kindler on Unsplash

SQL rules the world

If data is the new oil, SQL is its pipeline. SQL used to be “only” the language for accessing traditional relational OLTP databases. …


The indispensable stock IDE picture (photo by Clément H from unsplash.com)

Public and private clouds give developers the opportunity to run and deploy their software artifacts very quickly and scale computations — for example, Machine Learning training — to more-or-less indefinite power. How to bridge the gap between local development on the computer of the developers and the cluster is however a crucial open question.

This is a problem, that online, browser-only IDEs are trying to solve. Running the IDE directly on the cluster (no matter if public cloud or private company datacenter) has a large number of benefits

  • use the power of the cluster directly in your IDE, e.g. …

Nils Braun

Python Enthusiast, Data Engineer/Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store