3D Printing Tolerances and Joints

When printing pieces that will interlock or interact with each other, it is important to realize that a 5 mm peg will not fit a 5 mm hole. What is needed is a small offset so that the pieces fit into each other. The difference is very small, often measured in units of ten thousands of an inch (1/10000″) or in micrometers (0.1 mm).

These small differences are necessary but printer dependent and can define what the limits are to the size of the interacting pieces are that you print. There is a threshold below which the printer’s inherent limitations to resolution prevent successful printing of the necessary detail.

To calibrate your 3D printer, check out this https://teachingtechyt.github.io/calibration.html

To work on 3D printing joints, check out this link: https://3dprinterly.com/how-to-3d-print-connecting-joints-interlocking-parts/

Vaporware

Data science is a new field of expertise that makes use of computers to leverage information in data through the use of machine learning techniques, which is a subfield of artificial intelligence. One of the largest problems today for uninitiated is the threat of vaporware, or the selling of the idea rather than the solution. Given the novelty of the field and knowing there are people out there selling quick fix remedies to complex problems, how can one best distinguish between the ‘too good to be true’ and the ‘state-of-the-art’ solutions, abbreviated as ‘SoA’ or ‘SOTA’ more frequently in IT where could be confused with service oriented architecture abbreviated as ‘SOA’.

For investors attempting to derisk a startup or understand their special sauce, the best way to see through possible vaporware is to look at the data being used. Consider whether you think it might be possible that someone could capture information about the process they are selling from the data used to build a machine learning model. In the end, artificial intelligence is simply math applied to data so it is up to them to come with sufficient evidence to convince you that the data has the information in it. If you have your doubts, you can ask them to show you which features are driving the model or what the computer is looking for in an image or video to make its decision.

For companies contemplating a merger or acquisition (M&A) of a technology, in addition to understanding the data they used to generate their fantastical results, we advise signing an NDA to review the source code. There is a large difference between a proof of concept (POC) or minimum viable product (MVP) and a production ready, scalable solution. For example, many initial iterations will not have fully integrated functional version control or have a scalable solution that can be upscaled to tens of thousands of users without a complete rewrite or refactoring of the code. Data scientists are not data engineers or developers and the risk companies run when derisking or auditing a start up is that they assume that the POC or MVP will directly translate to an enterprise solution. If the valuation of the company relies on data, additional considerations need to be taken into account to assess the bias of the data. For example, in the healthcare industry or when using biometric information, such as facial recognition, is the information sufficiently representative of your target population or was it only captured from western Europeans? Will the company be labeled as racist after your algorithm is found not to be able to distinguish between Asians?

For companies with digital exhaust, data generated from existing solutions, and little experience with data or analytics, how do you rank external vendors? Every vendor has their own proprietary software solution to slice and dice data but in the end, they are software developers and not data scientists so what they are using is likely a cookie cutter solution wrapped in a shiny interface. From the small vendors that will not do any of the data preparation needed for the machine learning to the big boys looking to take your data and then create a solution that directly competes with a POC you paid to develop. For example, several fly by night vendors will advertise automated machine learning solutions that will provide you with the best fitting model very quickly. What they will not do is the 80% of the work required to prepare the data or the efficiency to allow you to leverage your dataset in a reasonable timeframe, i.e. seconds rather than weeks. On the other hand, many of the big names will offer non-exclusive licensing of any derived works from the project, such as other products developed with your data that may directly compete with your product, or results without sharing any of the software source code locking you into a vendor for the entire lifecycle of a product. A good external vendor is one provides all work generated during a project (source code, results, etc.), make no claim on the derived works regarding licensing or use as they would then be collaborators rather than external vendors, and expect only payment for their services. The data is where the value lays and there are many vendors out there wanting to double dip on their payment by taking payment for services as well as the value of the data you provide them.

There are many other situations where there are things to watch out for when in the new field of artificial intelligence, machine learning, and data science. If you feel uncomfortable or have questions, do not hesitate to reach out.

GPU accelerated ML Ubuntu setup

Each time I have a new workstation, I have to setup the machine learning environment with GPU acceleration. There are some tips and tricks and stumbling blocks I have run into and after a second time, I wanted to make notes for myself. This will be for Ubuntu 20.04 LTS with NVidia 2080 ti running Anaconda, CUDA 10.1, Tensorflow 2.1, and Keras (Python and RStudio) but may be helpful in other situations.

According to https://www.tensorflow.org/install and https://www.tensorflow.org/install/gpu the following has to be installed:

  • python: 3.5 – 3.7
  • pip: > 19.0
  • NVIDIA software:
  • NVIDIA GPU drivers: > 418.x
  • CUDA Toolkit: 10.1
  • cuDNN SDK: >= 7.6

For setup using Docker, check out my other [upcoming] post.

Prerequisites

Make sure your machine is up to date first (I like to reboot after updating):

sudo apt-get update
sudo apt-get upgrade
sudo reboot now

Ubuntu 20.04 LTS comes with most of the required dependencies installed:

sudo apt -y install build-essential
sudo apt -y install gcc g++

My NVidia drivers were installed using Ubuntu 20.04 LTS drivers:

sudo apt-get install nvidia-cuda-dev

Once installed, CUDA folder should be in the /usr/local/cuda-10.1/ folder. An easy trick to having multiple instances of CUDA on a single machine is to then create a sym link cuda that links to the version of CUDA you want to run.

Install cuDNN from the NVidia developers site and verify with nvcc –version command. It should be the same version as the CUDA version installed and do not use the deb package but the linux file.

OpenCV for Machine Learning Processing RTSP Video

The goal of this post is to document how to (attempt to) set up an Ubuntu environment with GPU OpenCV and allow for real-time object recognition on Real Time Streaming Protocol (RTSP) cameras. This would provide a solution for processing home security video feed or other video streaming sources.

Step 1: Installation Process

Creating the appropriate environment and setting up the software has been extremely tedious so far and motivated me to create this post. Essentially it comes down to the need for the OpenCV software to be compiled from source to turn on the GPU/CUDA flags to leverage a GPU rather than being dependent upon CPU processing, which is the default for the Ubuntu repo versions.

Dependencies

First there is a need to install video, imaging, and audio processing for Ubuntu. The following dependencies have been listed as required for OpenCV from other blog posts.

sudo apt-get update
sudo apt-get upgrade

Then install the dependencies for image and video processing.

sudo apt install gcc-8 g++-8
sudo apt install build-essential cmake pkg-config unzip yasm git checkinstall
sudo apt install libjpeg-dev libpng-dev libtiff-dev
sudo apt install libavcodec-dev libavformat-dev libswscale-dev libavresample-dev
sudo apt install libgstreamer1.0-dev libgstreamer-plugins-base1.0-dev
sudo apt install libxvidcore-dev x264 libx264-dev libfaac-dev libmp3lame-dev libtheora-dev
sudo apt install libfaac-dev libmp3lame-dev libvorbis-dev
sudo apt install libopencore-amrnb-dev libopencore-amrwb-dev
sudo apt-get install libdc1394-22 libdc1394-22-dev libxine2-dev libv4l-dev v4l-utils
cd /usr/include/linux
sudo ln -s -f ../libv4l1-videodev.h videodev.h

sudo apt install python3-testresources
sudo apt-get install libtbb-dev
sudo apt-get install libatlas-base-dev gfortran
sudo apt-get install libprotobuf-dev protobuf-compiler
sudo apt-get install libgoogle-glog-dev libgflags-dev
sudo apt-get install libgphoto2-dev libeigen3-dev libhdf5-dev doxygen

Install Tesseract for OCR

sudo apt-get install libtesseract-dev

One issue is to ensure also that libcublas10 is version 10.1 and not 10.0 or 10.2 as it corresponds to your CUDA version.

sudo add-apt-repository multiverse
sudo apt-cache madison libcublas10
sudo apt-get install libcublas10=10.1.[243-3] -V

I am using anaconda and am installing to the base installation environment but you can also create an anaconda environment for OpenCV specifically. See this link for examples on using custom anaconda environment: https://medium.com/machine-learning-mindset/opencv-anaconda-installation-in-ubuntu-98e4707ef611

Getting and Installing

I am creating a temporary directory for installing OpenCV.

mkdir ~/tmp/
cd ~/tmp
wget https://github.com/opencv/opencv/archive/master.zip
unzip master.zip
rm master.zip
wget https://github.com/opencv/opencv_contrib/archive/master.zip
unzip opencv_contrib.zip
rm master.zip

Then compile OpenCV (Note: you have to set the CUDA_ARCH_BIN=X.X based on your graphics card, information on the NVidia Dev website).

cd opencv-4.2.0
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE \
-D CMAKE_C_COMPILER=/usr/bin/gcc-8 \
-D CMAKE_INSTALL_PREFIX=/home/john/anaconda3 \
-D INSTALL_PYTHON_EXAMPLES=ON \
-D INSTALL_C_EXAMPLES=OFF \
-D WITH_TBB=ON \
-D WITH_CUDA=ON \
-D BUILD_opencv_cudacodec=OFF \
-D ENABLE_FAST_MATH=1 \
-D CUDA_FAST_MATH=1 \
-D WITH_CUBLAS=1 \
-D WITH_V4L=ON \
-D WITH_QT=OFF \
-D WITH_OPENGL=ON \
-D WITH_GSTREAMER=ON \
-D OPENCV_GENERATE_PKGCONFIG=ON \
-D OPENCV_PC_FILE_NAME=opencv.pc \
-D OPENCV_ENABLE_NONFREE=ON \
-D OPENCV_EXTRA_MODULES_PATH=~/tmp/opencv_contrib-master/modules \
-D OPENCV_PYTHON3_INSTALL_PATH=/home/john/anaconda3/lib/python3.8/site-packages \
-D PYTHON_DEFAULT_EXECUTABLE=/home/john/anaconda3/bin/python \
-D PYTHON_EXECUTABLE=/home/john/anaconda3/bin/python3 \
-D PYTHON2_EXECUTABLE=/home/john/anaconda3/bin/python2 \
-D PYTHON3_EXECUTABLE=/home/john/anaconda3/bin/python3 \
-D PYTHON_INCLUDE_DIR=/home/john/anaconda3/include/python3.8 \
-D PYTHON_PACKAGES_PATH=/home/john/anaconda3/lib/python3.8/site-packages \
-D PYTHON3_LIBRARY=/home/john/anaconda3/lib/libpython3.8.so \
-D ZLIB_LIBRARY_RELEASE=/home/john/anaconda3/lib/libz.so \
-D PNG_LIBRARY_RELEASE=/home/john/anaconda3/lib/libpng.so \
-D JPEG_LIBRARY=/home/john/anaconda3/lib/libjpeg.so \
-D TIFF_LIBRARY_RELEASE=/home/john/anaconda3/lib/libtiff.so \
-D BUILD_EXAMPLES=ON \
-D WITH_CUDNN=ON \
-D OPENCV_DNN_CUDA=ON \
-D CUDNN_LIBRARY=/usr/local/cuda/lib64/libcudnn.so.7.6.5 \
-D CUDNN_INCLUDE_DIR=/usr/local/cuda/include \
-D CUDA_ARCH_BIN=7.5 ..

After generating the make files for the build,

nproc
make -j12 VERBOSE=1

My make would often run but ultimately fail randomly. I found a work around is to clear the make cache file and rerun it.

rm CMakeCache.txt
make -j12 VERBOSE=1
sudo make install
sudo ldconfig

With the sudo ldconfig to update the maintained the shared library cache.

After the make and make install, OpenCV is installed but you need to manually go to /usr/local/lib/python3.8/site-packages/cv2/python-3.8/ and rename the OpenCV library file and move it to the anaconda library path for the base python environment:

sudo cp cv2.cpython-38-x86_64-linux-gnu.so cv2.so
cd ~/anaconda3/lib/python3.8/site-packages/
ln -s /usr/local/lib/python3.8/site-packages/cv2/python-3.8/cv2.so cv2.so

Then test it by loading python and loading the module.

$ python
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv2
>>> cv2.__version__
'4.5.0-pre'

Issues:

There are issues specific to my experiences installing OpenCV with Anaconda. First, I had issues with libcublas10 with the package installing 10.2 rather than the 10.1 version, which corresponds to the CUDA version. I had to downgrade the libcublas version with apt

sudo apt-get install libcublas10=10.1.243-3 -V

After downgrading I think I had an issue with CUDA so verify your CUDA version.

During cmake install, the process would fail and found a solution of removing the tiff package from anaconda (conda uninstall libtiff) but also just deleting the CMakeCache.txt file and rerunning the make install.

Clustering analyses

For clustering analysis we can use PCA to identify principle components that adjust for the greatest degree of variation. For the R function Morpho::procSym, the first PC adjusts for size and subsequent PCs adjust for additional variability in 3D landmarks.

A second approach is to use k-means algorithm to find clusters. One question we have is how to determine the number of clusters or size of ‘k’ for the algorithm. This is based on the following post: https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb

Determining the number of clusters can be done by the “Elbow Method” or “Silhouette Method”. The Elbow method is more of a decision rule while the Silhouette method is used for validation during clustering with both methods able to be used together to gain confidence in your decision.

Learning bioinformatics mapping pipelines

As a statistical epidemiologist and biostatistician, I need to understand how the data I have was generated and know where the data is coming from, which can only be confirmed if the data is reproducible. Through experiences and circumstances, I have realized that bioinformatics genetic mapping pipelines are not reproducible and typically developed individually in an ad hoc manner leading to reproducability issues in the field of bioinformatics.

In an effort to address this, the Common Working Language (https://github.com/common-workflow-language/cwltool) has been developed and combined with Docker to allow for reproducible pipelines.

In an additional effort to make it 100% deterministic, there is software that hashes (MD5) data and creates a common repository of data with their hashes so you can be 100% sure the data is explicitly consistent (https://guix-hpc.bordeaux.inria.fr/blog/2019/01/creating-a-reproducible-workflow-with-cwl/).

Going forward, I plan to implement bioinformatic pipelines and as a start, I have been playing with the CWL tutorial by Andrew Jesaitis (https://andrewjesaitis.com/2017/02/common-workflow-language—a-tutorial-on-making-bioinformatics-repeatable/).

 

Hierarchical Spatial Modeling

Starting a new project developing a hierarchical deterministic spatial model for a zoonotic disease combining human incidence, zoonotic risk factors, and an ecological model of the zoonotic lifecycle for modeling targeted interventions.

It is a fun side project in my free time that will rely on open source data and collaboration with disease experts. Eventually, the model will provide a platform for public health officials, general public, and researchers with the data being visualized in a multi-platform application. More information to come.

Virtualbox Guest Additions Linux

Irritating as it can be when something that should work does not, it is even more irritating when the error message is misleading. For example, today I came across an error when installing Guest Additions for Virtualbox in a Linux VM. The Guest additions failed and gave the possible cause as the lack of generic-headers. However, it is not the headers, which were installed, but DKMS. To solve this, install ‘build-essentials’ package.