Sweepnet : Wide-baseline Omnidirectional Depth Estimation

Chaerin Min

4 min readJul 1, 2021

Changhee Won, Jongbin Ryu and Jongwoo Lim

https://arxiv.org/pdf/1902.10904.pdf

입력 : 220도의 wide-baseline인 fisheye 카메라 4대에서 오는 input

출력 : depth estimation

과정

calibration
spherical sweep : spherical image plane에 warping(projection비슷한 것) 하기
sweepnet 통과시켜 costmap 얻기( cost function의 역할)
avg(4C2) & concat(N)
SGM으로 cost aggregation -> noise 제거
argmin으로 최종 depth(inverse depth index) 얻음

1. Calibration

Xp : p는 checkerboard의 corner point, x_tilda_ip : Xp를 투영한 것의 좌표

i번째 카메라에서 촬영한 k번째 capture의 extrinsic 파라미터에 대해 regid transformation을 해 주어 pose를 교정한 다음 좌표에 곱한다
1.번을 완료한 좌표를 intrinsic parameter를 고려한 projection함수에 넣어 2D로 투영한다
affine transformation 적용한다
checkerboard니까 투영 후의 정답인 x_tilda를 안다. 이를 이용해 모든 i,k,p에 대한 square error를 더한다
error를 최소화하는 내,외부 파라미터, affine 행렬을 구해준다(최적화)

2. Spherical Sweeping

normalized image plane은 pinhole과 거리가 1인 평면

p(theta, phi) :각 ray에 대한 좌표. depth를 알 수 없는 상태

xyz 좌표계로 표현하면 ρ(p) = (cos(φ) cos(θ), sin(φ), cos(φ) sin(θ))^T

2. p(theta, phi)를 W X H 크기의 spherical image plane에 warping

각 sphere의 반지름 = 1/dn, dn은 inverse depth index
warping했더니 image plane 밖으로 나가면 무시한다
Si,n은 가능한 모든 i,n에 대한 모든 ray들 (p들)에 대해 projection구한 뒤 input image Ii를 덧씌우는 개념

3. SweepNet

Si,n 과 Sj,n의 cost 계산해 주는 network (cost function F의 역할)

loss는 neg binary cross-entropy loss 사용

inv_depth_idx인 n은 (l,p)에 대해 이산적이기 때문에

각 (l,p,n)에 대한 GT는

라 할 수 있음

따라서 binary cross entropy 사용

Λ(ˆv, v) = −(ˆv log v + (1 − vˆ) log(1 − v)).

마지막 sigmoid → 결과로 나오는 costmap은 0과 1사이의 실수값들의 matrix
특징 : global (크기 줄여 receptive field 늘림) <-> local patch-based (다른 논문들)

4. Averaging & Concat

Averaging : 4개 카메라 중 2개씩의 pair에 대해 얻은 costmap들을 average해줌

Concat : N(inv_depth_idx의 개수)개를 concat

5. Cost Aggregation

SGM 사용하여 noise 제거 (refine)

6. 최종 Prediction 얻기

각 ray p에 대해 cost가 제일 작아지는 n을 골라 이것으로 determine

notations

X : 실제 3D공간에 있는 point

X_hat : fisheye 렌즈에 맺힌 점

x: 2D plane에 투영된 점

Π(X; Φ) : projection function

Φ : intrinsic parameters

Θ : extrinsic parameters = (r ^T,t^ T)^ T (r : rotation vector, t : translation vector)

{Θi} : 카메라 pose

{Θk} : checkerboard pose

{Θi,k} : 카메라 i에 대한 checkerboard k의 pose

A : affine 변환 함수

M(Θ) : rigid transformation matrix = [R(r) t ] (R(r) : rotation matrix )

i, j : camera 인덱스

k : capture 인덱스

별 : 합성. j에 대한 i의 상대적 pose 구할 때 사용

Experiments

dataset : realistic synthetic urban datasets, real-world datasets

evaluation

비교 대상 : ZNCC, MC-CNN

ablation study : cost aggregation (SGM) 유무

장점, 차별점 : 우리 카메라 장치는 전방향 입력을 받기 때문에 blind spot이 적다. 그러나 omnidirectional wide-baseline 데이터를 위한 depth 추정 연구는 많지 않다. 이 논문은 이런 종류의 입력에 적절하도록 spherical image에 warping하고 cost volume계산 시 avg,concat하고 있다. 또한 deep network에서는 global feature를 추출하고 cost volume은 cost agg해 주어 성능을 높였다.

further research point : stereo나 monocular depth 추정 쪽은 최근 연구가 활발한데 여기서 아이디어를 빌려올 수 있다. end-to-end로 구현할 수는 없는지 생각해 볼 수 있다. 광각 전방향 카메라의 단점은 무엇인지 알아본다. sweepnet의 구조를 바꾸면 뭐가 달라질지 실험한다.