We use cookies to optimize your user experience. We also share information about your use of our site with our social media, advertising and analytics partners. By continuing to use our site you agree to use cookies in accordance with our Privacy Policy.

Implementation of ARKit with Hand Gesture and Features Overview


The first VR-AR helmet was created in 1968. This helmet was connected to a computer and attached to the ceiling. Since then, the development industry has evolved, the computing power of processors has grown and devices have become increasingly compact. Nowadays, an AR device can be a mobile phone that has the necessary software. Already, there are applications on the market that use AR technology, which are not only entertaining but also useful in everyday life. This includes applications for measuring the shape of objects, apps for furnishing an apartment and many more.


  On March 18, Apple released its most powerful AR device to date, the new iPad Pro with a LiDAR scanner. The presented LiDAR scanner measures the exact transit time of light reflected from a surface at a distance of up to 5 meters. It functions both indoors and outdoors and reacts at nanosecond speed. Not surprisingly, it opens up colossal opportunities for augmented reality and beyond. Apps using ARKit will automatically get instant AR object placement, improved motion capture and human occlusion. This feature will work for devices with iOS 14 and newer releases.

At the WWDC 2020, a new version of ARKit 4 augmented reality framework was presented. One of the main features of this update is the Location Anchors technology. Location Anchors technology in ARKit 4 brings higher-quality AR content to the street, allowing developers to specify longitude and latitude for placing virtual objects. ARKit 4 then uses those coordinates and data from Apple Maps to place an AR object at a specific location, at a specific height in the real world. As a result, when a developer places a virtual object in the real world, such as a virtual sculpture in a busy square, that object will be saved and displayed in that location so that everyone who views it with an Apple AR device can see it in this location. «Location Anchors» will first launch in major cities such as Los Angeles, San Francisco, Chicago, Miami and New York, with more cities available later.

Overview concept 

Human interaction in some way with the AR environment is a very important function because it has a strong influence on the subsequent pace of development of AR technology.

In today’s article, we’ll walk through the implementation of a basic augmented reality app for iOS using ARKit in conjunction with finger tracking. We will consider 2 approaches, the first involves using a custom neural network and the second is using the functionality integrated in iOS 14 and higher.

Сaffe Neural Network model to CoreML

Currently, a lot of neural networks have been implemented and trained to solve different problems. In order to run a neural network in an iOS application, you can use a CoreML framework. This framework allows you to program neural network functionality. The first thing you need to do is convert from the format that you’re using to the CoreML format. For this task there is a coremltools framework written in Python which allows to convert .pb or .caffemodel formats. It also supports quantization of weights from 8 bits down to 1 bit.

Use the following command to install the coremltools package.

pip install coremltools

After a little bit of digging in the internet you can find a caffe model for detecting hand keypoints. In this article, you can learn more about how this network works and how to run it in Python or C ++.

Script to convert .caffemodel to quantized coreml

import coremltools
from coremltools.models.neural_network import quantization_utils

proto_file =
caffe_model =

coreml_model = coremltools.converters.caffe.convert((caffe_model, proto_file)
, image_input_names=
, image_scale=1/255.)

model_fp16 = quantization_utils.quantize_weights(coreml_model, nbits=8)

The script above allows you to get a neural network format corresponding to the CoreML framework, applies 8-bit quantization for the resulting network and saves the resulting model to the hand_keypoint.mlmodel file.

Quantization significantly reduces the size of the neural network in memory and, in some cases, reduces the image processing time, so this feature is very important and useful for developers.

Vision framework overview

In the new version of iOS 14, the Vision framework has the ability to determine hand keypoints, and also points of the whole body. This innovation will be interesting not only for developers of AR applications, but for manufacturers of computer vision applications in general. Previously, third-party manufacturers had already proposed algorithms for solving this problem, however, now there is no need for developers to look for other solutions. As a side note, proposed algorithms based on custom neural networks which were tested are inferior in performance to the “Vision” solution.

This method returns information about 21 points that correspond to each finger and point of the hand: 4 points for each finger and 1 point for the wrist. The image above represents the principle of naming key points for each finger.

Checkout below how easy it is to get key points.

private let handPoseRequest = VNDetectHumanHandPoseRequest()
// set the maximum number of hands 

handPoseRequest.maximumHandCount = 1// sampleBuffer – image
let handler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer, orientation: .up, options: [:]) 

do {     
  try handler.perform([self.handPoseRequest])     
  guard let observation = self.handPoseRequest.results?.first else {return}     
  let thumbPoints = try observation.recognizedPoints(.all)     
  let x = thumbPoints[.indexTip].location.x     
  let y = 1 – thumbPoints[.indexTip].location.y     
  let score = thumbPoints[.indexTip]!.confidence}
catch {     print(“error”)}

Executing this code, we’ll get the point position of index finger TIPs on a scale from 0 to 1 and a related score of confidence. To get the coordinate position on an image, you need to multiply the given value by the width and height of the transmitted image. As we can see, the script above allows you to get information for one point. In order to get information about the remaining 20, we have to call each point separately (you can find information on this here.) Also, in this framework, the definition of the keypoint of the whole body is available. Such a functionality will be useful for interacting with AR objects in real time.

Setting up ARKit

In this section, we’ll demonstrate the basic configuration. There are a lot of ready-made demo applications. Via this link, you can check out different demos which may inspire you.

ARKit setup:

  • Add the ARKit SceneKit View screen to your Storyboard.
  • We set up the ARSCNView class, in configuration we select detection of horizontal surfaces:

@IBOutlet weak var sceneView: ARSCNView! // Set the view’s delegate
sceneView.delegate =
let scene = SCNScene() // Set the scene to the view
sceneView.scene = scene
sceneView.autoenablesDefaultLighting =
sceneView.automaticallyUpdatesLighting =

override func viewWillAppear(_ animated: Bool) {        
  let configuration = ARWorldTrackingConfiguration()        
configuration.planeDetection = .horizontal        
sceneView.session.run(standardConfiguration)    }
  • Add an AR object to the scene.

In order to insert an object into a scene, we will use a point on the screen, which will subsequently be transformed into the augmented reality coordinate system.

let hitTestResultsPlane = sceneView.hitTest(tapLocation, types: .existingPlaneUsingExtent)
guard let hitTestResultPlane = hitTestResultsPlane.first else { return }
let translationPlane = hitTestResultPlane.worldTransform.translation
let bottleNodePosition = SCNVector3(Float(translationPlane.x), Float(translationPlane.y), Float(translationPlane.z))

Add a 3D label to the previously obtained coordinates:

let text = SCNText(string: “Codahead”, extrusionDepth: 1)
let material = SCNMaterial()
material.diffuse.contents = UIColor(red: 0/255, green: 255/255, blue: 0/255, alpha: 1.0)
text.materials = [material] 

let figNode = SCNNode()
figNode.name = “codahead”
figNode.position = bottleNodePosition
figNode.scale = SCNVector3(0.01, 0.01, 0.01)
figNode.geometry = text

Now we have got an AR application.

Combining ARKit and Vision

The final step is to combine the two modules presented above, you can implement this in different ways. In this example, we take a current frame from a sceneView, convert it to cmSampleBuffer and feed it to the hand point recognition algorithm.

let cvpixelBuffer : CVPixelBuffer? = (sceneView.session.currentFrame?.capturedImage)
if cvpixelBuffer == nil {return

var info = CMSampleTimingInfo()
info.presentationTimeStamp = CMTime.zero
info.duration = CMTime.invalid
info.decodeTimeStamp = CMTime.invalid
var formatDesc: CMFormatDescription? = nil
CMVideoFormatDescriptionCreateForImageBuffer(allocator: kCFAllocatorDefault, imageBuffer: cvpixelBuffer!, formatDescriptionOut: &formatDesc)
var sampleBuffer: CMSampleBuffer? = nil 

CMSampleBufferCreateReadyWithImageBuffer(allocator: kCFAllocatorDefault, 
                                                imageBuffer: cvpixelBuffer!,                                                
formatDescription: formatDesc!,                                                 
sampleTiming: &info,                                                 
sampleBufferOut: &sampleBuffer);        
if let sampleBuffer = sampleBuffer {     
  let handler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer, orientation: .up, options: [:])      
  do {           
  try handler.perform([self.handPoseRequest])            
  guard let observation = self.handPoseRequest.results?.first else {return}            
  let thumbPoints = try observation.recognizedPoints(.all)            
//here you can program your algorithm to process position of the hand and take any action       
catch {            

As a result, we get a ready-made demo algorithm for interaction of a human hand with AR objects. You can program certain gestures, for example, grabbing an object and/or rotating it. In the demo application for grabbing objects by hand, we used the Euclidean distance between the thumb and index finger and when the distance between them was small enough, the object was captured so we could manipulate the object.


In this article, you learned how to use ARKit with a built-in Vision framework. Hopefully it will inspire you to create your own AR applications for the iOS platform. The main disadvantage of this solution is that you get the hand’s key points in 2D, but this should be sufficient for most cases. This problem can be partially eliminated by determining hand rotations based on trigonometric formulas and rotating using quaternions or euler angles which we may cover in the next article. The latest version of iPad has a lidar sensor that can determine the depth of a scene up to 5 meters. Thanks to this, you can also take it as a basis and develop your own physics in an application. In any case, the development of the AR industry, like Machine Learning, has great potential in the near future.



Andrei Liudkievich

AI Dev