Tools and frameworks

Tools and frameworks

Data Science on Apache spark is very data heavy. Numerical computation is performed over data residing in multiple nodes on complex types.

Machine learning algorithms are very computation heavy and don't require too much data movement and are usually over simple data types.

How to improve performance?

1. Move towards better computational systems - GPU's, new chips

2. Move towards increasing the number of systems.

Approaches by scaling number of nodes:

1. Apache Spark MLlib

2. Tensorflow on Spark

3. Google cloud machine learning

4. Azure learning studio

5. Dataflow

Comments