Sometimes it is hard to believe that a world before ML existed. So many modern data analyses are built on top of ML techniques and will continue to do so in the foreseeable future. However, not everyone is able to benefit from these vast advances, because using ML techniques mostly involves using Python, developing code, and understanding many new technologies. Especially when Big Data and distributed systems enter the game, things get messy.
Even the best data does not matter if no one can access it. Therefore, it is crucial to make data accessible to everyone, without the need for them to learn complicated programming languages. This is a problem, that SQL query engines try to solve. They allow everyone with some SQL knowledge to get access to data stored in typical big data systems.
Problem solved? Maybe not. Because in many cases, your data does not live in a nice-to-access data lake like S3. And the computations you want to apply to it are machine learning algorithms or other (typically Python-based) complex…
In the last post, we have explored how
tsfresh automatically extracts many time-series features from your input data. We have also discussed two possibilities to speed up your feature extraction calculation: using multiple cores on your local machine (which is already turned on by default) or distributing the calculation over a cluster of machines.
In this post we will go one step further: what happens if your data is so large that loading the data into your scheduling machine is not an option anymore? For many applications, this is not the case (and keeping everything local speeds up the development…
Time Series data is everywhere today. From stock market trends to EEG measurements, from Industry 4.0 production lines to IoT sensors — temporarily annotated data hides in a lot of places. It comes in many flavors, sizes, and complexities.
In this series of two posts, we will explore how we can extract features from time series using
tsfresh - even when the time series data is very large and the computation takes a very long time on a single core.
But first, let’s define some common properties of time series data:
As sensors get cheaper and smaller, the amount of data collected around us increases tremendously. Finding good algorithms and techniques to dig through this flood of data is key here. In this post, we will look into a very typical example: sensor data time series classification. Think of predictive maintenance (you want to classify if the machine the sensor is monitoring will break soon), earthquake prediction (is this seismogram signal likely to come from a starting earthquake?), or Industry 4.0 applications (the sensor data can be used to decide if the quality of the produced piece is good enough).
GitHub actions are the new CI/CD tool and they come with some very nice benefits: simple workflow file syntax, good documentation, very nice integration with GitHub (obviously), and — most importantly — the possibility to write and share created actions easily.
There exist a large marketplace of community-written actions, but even if you do not find what you are looking for: it is very easy to write your own. In this blog post, we will walk through the process of writing, testing, and publishing a simple GitHub action written in TypeScript from scratch. Its task will be to read in…
This post will describe what an SQL Query Engine is and how you can use dask-sql to analyze your (big) data quickly and easily and also call complex algorithms, such as machine learning, from SQL.
If data is the new oil, SQL is its pipeline. SQL used to be “only” the language for accessing traditional relational OLTP databases. …
Public and private clouds give developers the opportunity to run and deploy their software artifacts very quickly and scale computations — for example, Machine Learning training — to more-or-less indefinite power. How to bridge the gap between local development on the computer of the developers and the cluster is however a crucial open question.
This is a problem, that online, browser-only IDEs are trying to solve. Running the IDE directly on the cluster (no matter if public cloud or private company datacenter) has a large number of benefits
Python Enthusiast, Data Engineer/Scientist