From:                              route@monster.com

Sent:                               Monday, September 28, 2015 1:00 PM

To:                                   hg@apeironinc.com

Subject:                          Please review this candidate for: Talend

 

This resume has been forwarded to you at the request of Monster User xapeix03

Ankit Rakha 

Last updated:  02/11/15

Job Title:  no specified

Company:  no specified

Rating:  Not Rated

Screening score:  no specified

Status:  Resume Received


Palo Alto, CA  94301
US

Mobile: 4104020406   
ankit21@cs.jhu.edu
http://www.cs.jhu.edu/~ankit21/
Contact Preference:  Mobile Phone

Quick View Links:

Resume Section

Summary Section

 

 

RESUME

  

Resume Headline: Ankit Rakha - Big Data and Analytics Practice Lead

Resume Value: 4ru9bgiexvdc2b7u   

  

 

Ankit Rakha
17 E. Sir Francis Drake Blvd. —— Suite 110 —— Larkspur, CA 94939
Phone: 410-402-0406 | GitHub page: https://github.com/ankit-rakha
E-Mail: ankit21@cs.jhu.edu
Education
o MSE in Computer Science, Johns Hopkins University, USA
Advisor - Prof. Mitra Basu & Edward J. Schaefer Professor S. Rao Kosaraju
o B.Tech in Electronics and Communication, MMET Deemed University, India
Publications & Patents
o Journal Articles:
Human Mutation 2011 - "Rapid and efficient human mutation detection using a bench-top next-generation
DNA sequencer" - Qian Jiang, Tychele Turner, Maria Sosa, Ankit Rakha, Stacey Arnold and Aravinda
Chakravarti
Circulation Research 2012 – “Effects of Rare and Common Blood Pressure Gene Variants on Essential
Hypertension: Results from the FBPP, CLUE and ARIC Studies” - Khanh-Dung H. Nguyen, Vasyl Pihur, Santhi K.
Ganesh, Ankit Rakha, Richard Cooper, Steven C. Hunt, Barry I. Freedman, Joe Coresh, Wen H. L. Kao, Alanna C.
Morrison, Eric Boerwinkle, Georg B. Ehret, Aravinda Chakravarti
o Patents:
A Method of detecting retained foreign bodies (RFBs) - Joint Collaboration with Johns Hopkins Bayview Medical
Center, Reference No. C10417 - Ralph Etienne-Cummings, Ankit Rakha, Bolanle Asiyanbola
o Poster Presentation:
RECOMB 2008 - “HLA class I peptides: Exploiting positional information for identification and classification” -
MitraBasu, Ankit Rakha
ISMB 2008 - “Identification of key amino acid properties for HLA Class I peptides: A Computational Approach” -
Ankit Rakha, Mitra Basu, Rao Kosaraju
Experience
o Big Data and Analytics Practice Lead, Lilien Systems
March 2014 – Present
1. Develop and implement Big Data and Analytics solutions which include designing, deploying, testing
and provide post-installation support.
Jan 2007 – April 2008
Master Thesis - “A Quantitative Study of Ensemble Characteristics of peptides that bind to MHC class I
molecule”
Sept 2001 – May 2005
Page 1 of 8
2
2. Work with the Lilien Infrastructure and Network Solution Architects to design and implement
comprehensive “soup to nuts” Analytic solutions, integrating into the customer’s infrastructure across the
data center.
3. Identify and scope the services necessary to implement solutions including the preparation of pricing
estimates, statements of work and RFP responses.
4. Set up vertica database for development environment in AWS and for production environment in data
center.
5. The vertica set up including schema creation, roles, users and permissions and offloading of data from
Oracle warehouse was automated using Qubell, integrated with F5 for load balancing, Tableau for
visualization and Talend for data integration.
6. Set up dev and production hadoop cluster using cloudera distribution.
7. Upgraded vertica production database from 6.x to 7.x version.
8. Designed and implemented the regular data roll out from vertica to hadoop cluster. We keep only 3
months of data in vertica and rest goes to Hadoop. The data in hadoop cluster is accessed as external
hive tables and on vertica side as external vertica tables.
9. Integrated R-Vertica, R-Hadoop: R-Impala & R-Spark, Vertica-AWS in order to perform metric and
predictive analytics on game data.
10. Architected and proposed a solution for activator discovery, profile, message and lead vendor
screening using state of the art machine learning algorithms for distributed computing frameworks. The
ecosystem consist of big data technologies like vertica, hadoop, distributed R and Spark.
11. Architected the hadoop cluster for their recent initiative to create a hybrid platform using Apache
Kafka, Vertica and Hadoop. Extensively involved in creating the specifications of the hadoop, vertica
and distributed messaging queue nodes while designing the infrastructure layer.
12. Architected and proposed a predictive real time analytics platform for churn prediction for first day
churners through first session experience, monetization alternatives for potentially churning player in the
first session, churn prediction through customer’s last experience of x sessions and many more use cases.
The ecosystem consists of a seamless integration of VoltDB, Vertica and Hadoop and the use of
machine learning algorithms like linear regression, k-means, SVMs, decision trees and Page Rank tailored
for the particular use cases.
o Software Engineer – Big Data Team, Serendipity (Parent company: GCE)
1) DevOps
Jan 2013 – March 2014
a) Maintained 25 cluster nodes / 1200 cores running Centos 6.4. Administration responsibilities included
assembling of the servers, installing OS using PXE, updating system software like upgrading gcc to
the latest versions using source code, installation of Vertica, Hadoop and MPI stack along with Solr.
b) Wrote custom build script with clean and fast options for application developers to build and deploy
the MPI project.
Page 2 of 8
3
c) Installed and used Chef for configuration management.
d) Also, provided support in migrating federal government datasets from Vertica database to HBase.
We wrote custom tools namely DataSync, DataImporter and DataProvider for this purpose.
2) Distributed computing using MPI
Worked on creating low latency distributed computing system for real time analytics. We used openMPI
1.7.2 and intel mpi 4.1.2 (for multi-threading support), C programming language and HBase as working
components for this project. Introduced multiple sub-projects like
a) Statistical algorithm service - Implemented plethora of statistical algorithms in distributed fashion
using MPI routines like MPI_Send, MPI_Recv, MPI_Reduce, MPI_Allreduce, MPI_Gather, MPI_Gatherv,
MPI_Bcast. Some of the implemented algorithms include: Numerical histogram, Categorical
histogram, Minmax, Kmeans, Regression, Pearson and Spearman correlation coefficient, Line
intersection algorithm, Percent distinct, Quantile, Anomaly detection, Median etc.
b) Data service – Worked with other team members to create a service that uses Apache Thrift API to
read data from HBase and create a memory store.
c) Instruction topology – Created multi-threaded topologies with execution plan to recommend
visualizations to users.
d) Graph service – Represented the hierarchical data in the form of graph nodes and edges to
implement zoom-in and zoom-out capabilities while analyzing the data.
3) Graph Database
Evaluated two graph databases – Neo4j and Titan for semantic analysis work. Migrated data from
Vertica database to the graph database so that information can be represented in the form of nodes
and edges. Used batch-import and parallel batch inserter tool for migration of data into Neo4j -
https://github.com/jexp/batch-import and faunus + HBase for Titan graph database.
o Software Engineer, Mckusick Nathans Institute of Genetic Medicine, Johns Hopkins University School of
Medicine
May 2008 – Dec 2012
A) Software development & Computational analysis
1) Responsible for implementing, documenting and maintaining software tools, scripts and codes.
2) Worked with biologists and statisticians in research environment to meet their dynamic programming
needs.
3) Development and applications of molecular genetic, genomic and computational methods for
identification of human disease genes. Dealt with data sets that are immense in size and developed
new and better algorithms to analyze them using parallel computing platform for development,
evaluation and application of statistical and computational methods on tens of thousands of genomic
arrays. The algorithms we used require large-scale memory access and lot of matrix manipulation. So
we utilized grids hosted in Johns Hopkins University and University of Maryland for data processing and
interactive simulations with very large, real-life data sets. We broke the computations (serial code) into
Page 3 of 8
4
independent tasks (parallel) as in parallelization of FOR loops or Monte Carlo Simulations for performing
analysis very fast.
B) IT consultation
1) Analyzed IT requirements and provided strategic guidance to research groups within Johns Hopkins
Hospital with regard to IT technology, cloud service infrastructures and distributed computing. This
included but not limited to introduction to Apache hadoop ecosystem, scale out storage and highperformance
enterprise file management platforms like EMC Isilon, OneFS, GPFS and GlusterFS.
2) Organized seminars and lead meetings with IT teams from Red Hat, Cloudera, Infochimps, IBM and
EMC.
3) Prepared documentation and presented progress reports to the principal investigators regarding new
distributed computing infrastructure and return on their investments.
C) Database development
1) Reviewed process design, stored procedures to correct errors and make necessary modifications in SQL
codes in internal database system.
2) Worked with lab members to write new database procedures, functions and triggers.
D) System administration
1) Interacted and closely worked with System administrator, LAN administrator & Programmer Analysts to
develop specifications for solutions that were ultimately implemented by the developer.
2) Maintained lab’s raw data sets repository, local web servers of publicly available and licensed
databases, remote access & internal server accounts for current lab members, archiving of old lab
members server data.
E) Web programming
1) Responsibilities included both front-end and back-end programming using different server-side
frameworks to build database driven websites, cross-browser compatibility testing, and search-engine
optimization tasks for deployed websites.
o Research Assistant, Computer Science Department, JHU
June 2007 - Aug 2007
Worked on Machine Learning techniques to characterize HLA peptide-binding predictions associated with
Arthritis disease. Closely worked with collaborators and research groups to develop streamlined processes,
workflows, and database applications for bioinformatics applications. Prepared and maintained
documentation to support use of the workflow and database applications developed. Also, created and
presented tutorials on bioinformatics, genomics and proteomic software and databases.
Skill Sets
o Programming Languages – C/C++ (excellent knowledge, several team projects of more than 25K lines of code)
Page 4 of 8
5
o Scripting languages – Unix shell scripting (well mastered), perl, sed & awk
o Statistical languages - MATLAB, R (heavily for plotting)
o Data Interchange formats: JSON, Google Protocol Buffers
o Data Integration tools: Talend, Syncsort DMX-h
o Visualization tools: Tableau, Yellowfin
o CMS – Drupal (have used version 6 & 7), Wordpress
o Database systems – HBase, Vertica, VoltDB, Riak, PostgreSQL, MySQL, Neo4j, Titan
o Version control system – git, SVN
o Editors/IDE – Xcode, Intellij, Netbeans, AptanaStudio3, TextMate, Sublime, Notepad++
o OS - Mac OS X, Red Hat Enterprise Linux (RHEL), CentOS, Linux Fedora, UBUNTU, Windows XP, 7 & 8
o Cloud computing
Parallel & distributed computing using SGE and TORQUE, Amazon EC2 for cluster based applications, used
Apache Hadoop, Hadoop Distributed File System & MapReduce for large datasets and number crunching
problems. Highly familiar with CLUSTERS hosted at following sites:
1) Sun Grid Engine hosted by IGM, Johns Hopkins University – School of Medicine
(https://paradigm.jhmi.edu/services/hpc)
2) Sun Grid Engine hosted by Biostat department, Johns Hopkins School of Public Health
(http://www.hpscc.jhsph.edu/)
3) Terascale Open-Source Resource and QUEue Manager hosted by Johns Hopkins Physics and
Astronomy Department (https://hhpc.idies.jhu.edu/wiki/index.php/Main_Page)
4) Data Instensive Academic Grid hosted by IGS, University of Maryland – School of Medicine
(http://diagcomputing.org/)
5) Amazon Web Services - http://aws.amazon.com
o Bioinformatics Softwares & Databases – PLINK, Galaxy, Haploview, Beagle, SAMtools, VCFtools, tabix, snpEff,
PolyPhen, SIFT, ANNOVAR, bigwig, EVS, GATK
Projects
DATABASE & WEB-APPLICATION DEVELOPMENT
o HDRC – A collaborative research website (https://hdrcstudy.org/)
Page 5 of 8
6
Building an early-stage collaborative research website using Drupal 7. It required hands-on development in
PHP, MySQL, HTML, JavaScript, and CSS. Performed browser testing, security audits, and stress tests. Maintaining
web analytics, conversion reporting, and log analysis tools, including Google Analytics and urchin tracking
code. Administered MySQL databases. Wrote command-line utilities and scheduled tasks using cron.
o LabWorks – Internal Database application (http://labase-web1.igm.jhmi.edu:8500/labworks/login/login.cfm)
Designed and implemented modules of a database web application using Adobe Coldfusion, Dreamweaver,
postgreSQL database & pgAdminIII with six major components and supports 11 postgreSQL databases. The
major components include enrolling new patients and their information in encrypted manner into the
database, organizing their DNA samples in hierarchical order, maintaining corresponding genotypes for all the
SNPs that were genotyped, location information of the samples in the cold storage, file uploader for keeping
the track of protocols that were used in genotyping phase and administration section for adding new groups
and members.
o ACLAB website – An iWeb project (https://aravindachakravartilab.org)
Designed and built a simple, clean and crisp customized website using Apple’s iweb software (3.0.3) and
JavaScripts.
o Request Tracking system – Ruby on rails project (http://alabwiki.jhmi.edu:3000/login)
Created components of a RTS using Ruby on rails framework to manage requests for administrative managers
efficiently. It provides a flexible role based access control to a group / lab.
o ARIC & FEHGAS Wiki system
(http://162.129.237.10/aric/index.php/Main_Page,http://162.129.237.10/jhu/index.php/Main_Page)
Maintaining online database wikis with MySQL database at the back end to share data and information
between collaborators of an ongoing research project.
SOFTWARE DEVELOPMENT
o HSCR Exome Sequencing Project – Data analysis and text mining using map-reduce paradigm
Currently analyzing exome data for Hirschsprung disease in 304 individuals (3.3 TB of binary data) using The
Genome Analysis Toolkit framework developed by Broad institute. We extensively take advantage of mapreduce
paradigm in Quality control of the data. The tools being used include various walkers like depth of
coverage analyzer, snp/indel caller and variant quality score recalibrator. Annotation part of the project
concentrates on using text mining techniques to extract the information about genes, proteins and their
functional relationships from html webpages, text files and backend databases.
o RSNG Sequencing Project – An automated pipeline for identification of potential heterozygotes
In this project, we implemented a computationally efficient linux computer program using customized unix shell
scripts, Matlab distributed computing server and various sub-components including parallel, bioinformatics and
statistics toolboxes that compares sequences from 11 genes across large number of amplicons obtained from
different individuals (no. of individuals = 560) to identify heterozygous sites for single nucleotide substitutions.
o GS Junior Sequencing Project – Rapid and efficient human mutation detection using a bench-top nextgeneration
DNA sequencer
Analyzed large sequencing data generated by bench-top next generation DNA sequencer. It involved
calculating the similarity between the consecutive runs for the same set of patients and determining the
Page 6 of 8
7
accuracy of the machine. Also generated a nice grid output for the users so that they can adjust their
amplicon pooling after each run.
o Image Processing Project – A Method of detecting retained foreign bodies (RFBs)
In this research project, we developed a model that detects if there is any surgical instrument left in the patient
body after an operation. This work involves SVM based method to classify X-ray images that have surgical
instruments left over.
o SLAM - Simultaneous Localization And Mapping (Cross Collaboration with NASA, UMD and UAF)
Worked in a collaborative enviornment for developing a computationally efficient sensing and modeling
method that uses images from one or more cameras on a robotic vehicle to build a world map, and to
accurately localize the robot within this map.
o Bioinformatics Project - Protein-Peptide Binding Prediction based on Learning Peptide Distance Functions
Designed & Implemented data mining & machine learning techniques for the classification of the peptides
binding to HLA protein using Multiclass Adaboost, Gaussian Mixture Models. The general approach can be
considered to be semi-supervised learning problem with partial information in the form of equivalence
constraints and also involves learning peptide-peptide distance functions.
o Machine Learning Project - Supervised & Unsupervised Learning for Cluster Analysis
Implemented algorithms that can discover high purity clusters in unsupervised, large, very noisy, highdimensional
datasets where most of the data points do not cluster well. Application domains include marketbasket
data.
o Robotics Project - RoboCup, Robot Soccer World Cup
Vision – Designed and Implemented Adaboost Algorithm with a variation of weighted nearest-neighbor
classifier for the weak learner part for classification and detection of soccer ball independent of its color, in the
presence of changing illumination, texture and partial occlusion.
Non-Profit & Volunteer Work
1) Lending loans to alleviate poverty through a non-profit organization named Kiva. Kiva works with
microfinance institutions on five continents to provide loans to people without access to traditional
banking systems. URL: http://www.kiva.org/lender/ankitrakha
2) Contributing member of a distributed computing project named SETI@HOME. SETI@home is a scientific
experiment that uses Internet-connected computers in the Search for Extraterrestrial Intelligence (SETI).
URL: http://setiathome.berkeley.edu/view_profile.php?userid=9795743
Keywords
Software & Web Applications Development, BIG DATA, Machine Learning, Statistical Analysis, Computational
Biology, Data Mining, Pattern Recognition, Artificial Intelligence, Bioinformatics, Multiple Classification &
Clustering Analysis
Page 7 of 8
8
Page 8 of 8



Experience

BACK TO TOP

 

Job Title

Company

Experience

Big Data and Analytics Practice Lead

Lilien Systems

- Present

 

Additional Info

BACK TO TOP

 

Desired Salary/Wage:

190.00 - 225.00 USD yr

Current Career Level:

Manager (Manager/Supervisor of Staff)

Years of relevant work experience:

2+ to 5 Years

Date of Availability:

Immediately

Work Status:

US - I require sponsorship to work in this country.

Active Security Clearance:

None

US Military Service:

Citizenship:

Other

 

 

Target Job:

Target Job Title:

Big Data and Analytics Architect

Desired Job Type:

Employee

Desired Status:

Full-Time

 

Target Company:

Company Size:

Occupation:

IT/Software Development

·         Software/System Architecture

 

Target Locations:

Selected Locations:

US-CA-Silicon Valley/Peninsula

Relocate:

Yes

Willingness to travel:

Up to 25% travel

 

Languages:

Languages

Proficiency Level

English

Fluent