Computer Vision

Using the 35 fisheye cameras placed around the building, we developed blob tracking algorithm to track people inside the building.

The input devices used are Vivotek fish eye cameras, which capture a video frames as input data for tracking, and Kinect V2 , which is used to capture RGB and depth images. Image acquisition occurs at 15 FPS 1280 x 720 resolution. These cameras are fixed in the ceiling and they capture 360 degree images of the scene. We have to configure these network cameras to use them in our project. Figure 4 shows a flow diagram containing the steps executed from the point when an image is captured to the point where the virtual model is updated according to any changes that occur in the physical world.

Tracking through Multiple cameras:

The existing model was only designed for one camera. We successfully extended it to work with multiple cameras, which is essential to facilitate a smooth transition when a person moves from the field of view of one camera to another. We cropped the images and used Kalman filtering to achieve the same. The co-ordination between Unity model and actual video frames is done by calculating the scaling factor to map the blob location in virtual environment.

Background estimation:

Initially, the system was using Gaussian background subtraction to estimate the background. We switched over instead to the k-Nearest neighborhood classification algorithm. We have considered 500 frames to estimate the background model. This is because the people being tracked will not always be in a constant state of motion. The only drawback of this approach is that if some object in the scene is moved by a person, that object will remain in the foreground for some extra time before it is removed. For people, only a prolonged interval of complete stillness will result in someone being removed from the foreground and that object will be treated as part of the background.

Kalman filtering:

To improve real-time tracking of people in the physical world, we used Kalman filtering. Our system was based on OpenCV 3.0. The transition matrix was generated by an inbuilt OpenCV function. We have considered centroid.x , centroid.y, centroid.z states. The third state is constant in our case as we are not considering the depth variation. The centroid for the blob location is calculated by finding the contours in the binary images. Contour filtration algorithm is used to filter out the small false blobs and also erase them from the track. The filtered tracks with current centroid locations are predicted by using cv::KalmanFilter::predict() function. This will predict the centroid location i.e. centroid.x , centroid.y, centroid.z. Based on the predicted location and calculated position of the centroids, the location is corrected by using 

cv::KalmanFilter::correct(). In the qualitative video section we can see that object is tracked with consistent ID.

Communication with server:

The network consists of several different components. First, the blob tracking software will send the data it gathers to a local Node install via UNIX pipes. A connection is established to data server via websockets. These data servers will be listening to websockets connection on predefined port. The current connection to the data servers are on port 9999 with the server url This is an online server and data from the local client is sent through internet. The server will be able to support any number of blob data sources pushing data to it at once, the only limitations being the bandwidth of the network and the ability of the machine to keep up with processing all of the incoming data. This server is able to receive data from any number of clients to receive blob information data in real time. The information currently sent to the server contains the following structure extracted from the input data: Camera ID, origin location, blob bounding box width, height, center of the blob location source type (Matlab / OpenCV ), image width, image height and orientation. Currently the structure does not provide any information about the z co-ordinate for the blob location. Our plan is to extract the depth information from the scene using Kinect.


Before any image processing algorithm is carried out on the video frames, we applied image warping on the frames to crop the region of interest (floor region) which is the main region for tracking. For this, we carried out the intrinsic parameter calibration to get the camera matrix and distortion coefficient corresponding to the fish eye cameras. This is a one-time operation, and was done with the help of checkerboard images. These parameters have to be computed for all the cameras as hardware configuration varies slightly across difference cameras. We have taken checkerboard images to cover the entire 360 degree scene captured by fish eye camera to obtain the correct intrinsic parameters for fisheye cameras.

Distortion Coefficients D for Camera 1 (Observation Room) : [ 0.00191756652, -0.0303766206, 0.0412752517, -0.0189904887 ]

Camera matrix K = [ 339.81179, -3140.46069, 529.799377, 

                                    0, 340.305389, 479.500061, 

                                    0, 0, 1];

These calculated parameters are used to apply image warping on the images in real time video frames. We can observe the wall edges shown in the left images are significantly improved in the right image. The offset in the previous approach was removed by this modification.

Computer Vision



Research Infrastructure