float16 matmul is way slower than float32 matmul on CPU

float16 matmul is way slower than float32 matmul on CPU #24738

 Open

dchatterjee172 opened this issue on 7 Jan 2019 · 1 comment

 Open

float16 matmul is way slower than float32 matmul on CPU#24738

dchatterjee172 opened this issue on 7 Jan 2019 · 1 comment

Comments

@dchatterjee172

 

dchatterjee172 commented on 7 Jan 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): 1.12.0
  • Python version: 3.5.2

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior
float16 matmul is way slower than float32 matmul on CPU

Code to reproduce the issue

import tensorflow as tf
import time
from datetime import timedelta


a = tf.random.normal(shape=[1, 768, 768], dtype=tf.float16)
b = tf.random.normal(shape=[1, 768, 768], dtype=tf.float16)

c = tf.random.normal(shape=[1, 768, 768], dtype=tf.float32)
d = tf.random.normal(shape=[1, 768, 768], dtype=tf.float32)

e = tf.matmul(a, b)
f = tf.matmul(c, d)

config = tf.ConfigProto(
    intra_op_parallelism_threads=24,
    inter_op_parallelism_threads=24,
    allow_soft_placement=True,
    device_count={"GPU": 0},
)

with tf.Session(config=config) as sess:
    for i in range(100):
        if i % 2:
            print("16bit -- ", end="")
            op = e
        else:
            print("32bit -- ", end="")
            op = f
        start = time.monotonic()
        sess.run(op)
        end = time.monotonic()
        print(i, timedelta(seconds=end - start))

output

2019-01-07 16:06:19.698878: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow bi
nary was not compiled to use: AVX512F
32bit -- 0 0:00:00.017297
16bit -- 1 0:00:00.275746
32bit -- 2 0:00:00.002908
16bit -- 3 0:00:00.261320
32bit -- 4 0:00:00.003028
16bit -- 5 0:00:00.253561
32bit -- 6 0:00:00.002849
16bit -- 7 0:00:00.256515
32bit -- 8 0:00:00.006011
16bit -- 9 0:00:00.255613
32bit -- 10 0:00:00.003996
16bit -- 11 0:00:00.242231
32bit -- 12 0:00:00.003338

@jvishnuvardhan jvishnuvardhan self-assigned this on 9 Jan 2019

@jvishnuvardhan jvishnuvardhan added type:support type:others comp:ops labels on 9 Jan 2019

@jvishnuvardhan jvishnuvardhan assigned rmlarsen and unassigned jvishnuvardhan on 9 Jan 2019

@naisy

 

naisy commented on 9 Jan 2019

It is simple, because Intel Architecture does not support FP16.
See also:
https://stackoverflow.com/questions/49995594/half-precision-floating-point-arithmetic-on-intel-chips
https://stackoverflow.com/questions/15340781/python-numpy-data-types-performance

👍 1

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章