Exploring ML and Big Data

Posts

Parameters to compare performance based on

For K means: Run algorithm for various -values of K -num of iterations -num of executors - size of the dataset Regression trees to find out the parameters that affect the total execution time the most. 1. total execution time - To predict 2. number of jobs 3. avg time per job 4. number of stages 5. avg time per stage 6. Number of shuffle and reads 7. Number of Retries 8. Number of executors 9. avg number of tasks per stage 10. time per task 11. Get thread pool values for executors. (Repeat for MLlib, ML, tensorflow and dataflow) - check which has the best performance Check values for each executor - ????

Comparing various tools

The level of distribution : the distribution of tensorflow is achieved at Graph level, facilitated by subgraph execution of tensorflow. The component of tensorflow Graph (Tensor/Variable/Operation) can not be distributed. While Spark’s distribution is achieved at RDD level which is the base of Spark. That it to say all the RDD operation and computational graph that is built on RDDs are distributed. tensorflow supports asynchronous training : Asynchronous training is supported naturally by concurrent execution of replicated subgraphs. In addition, synchronous training is also possible in distributed tensorflow. While Spark only supports synchronous computation, since Spark follows Bulk Synchronous Parallel(BSP) model. Therefore, asynchronous training in SparkMLlib hardly happens. tensorflow supports parameter-server & worker structure: in distributed tensorflow user can assign a device with either ps task or worker task. I think this feature is...

Exploring ML and Big Data

Search This Blog

Posts

Scala

Parameters to compare performance based on

Comparing various tools

Dataflow

Azure learning studio

Google cloud machine learning

Tensor flow on Apache Spark