diff --git "a/nix-build.log" "b/nix-build.log" --- "a/nix-build.log" +++ "b/nix-build.log" @@ -10,18 +10,18 @@ evaluation warning: `rev` argument of `genFlakeOutputs` is deprecated, pass `sel path = ./.; }; these 7 derivations will be built: - /nix/store/5i3gnhgvv278c7m9q3x3agksl5jab9ck-sage_attention-torch-ext.drv - /nix/store/bawig99wpvl8dvmdb3znykgir3w1nw15-sage_attention-torch-ext.drv - /nix/store/jzgwmpf18h1rrvfhclhq4say6m90j7y4-sage_attention-torch-ext.drv - /nix/store/msa3cr0rrgkm0dagqbcs67k8s169474b-sage_attention-torch-ext.drv - /nix/store/qckl1ak5l089b2sakw4h2whnd6mg16ld-sage_attention-torch-ext.drv - /nix/store/xq28asxbqp6g7x8bcz92xl849prg2899-torch-ext-bundle.drv - /nix/store/rkzh9xwk6kdgl1by4xfwmyvb5arpfqby-build-and-copy.drv -building '/nix/store/5i3gnhgvv278c7m9q3x3agksl5jab9ck-sage_attention-torch-ext.drv'... -building '/nix/store/bawig99wpvl8dvmdb3znykgir3w1nw15-sage_attention-torch-ext.drv'... -building '/nix/store/jzgwmpf18h1rrvfhclhq4say6m90j7y4-sage_attention-torch-ext.drv'... -building '/nix/store/msa3cr0rrgkm0dagqbcs67k8s169474b-sage_attention-torch-ext.drv'... -building '/nix/store/qckl1ak5l089b2sakw4h2whnd6mg16ld-sage_attention-torch-ext.drv'... + /nix/store/d04lffjyka9nfvrhmr8863813bwkcn0w-sage_attention-torch-ext.drv + /nix/store/g8li7ymzmry7mpxm9k43zkmhzrk2nxsz-sage_attention-torch-ext.drv + /nix/store/gaqkbs2b2k9x1yh88ax8vb5rnnl81xmy-sage_attention-torch-ext.drv + /nix/store/jlfw8d4bqv1cgckbkmjsam39djmgjsl1-sage_attention-torch-ext.drv + /nix/store/p79rdm4kvf0jr7vkv277nbi5mmc2lwyb-sage_attention-torch-ext.drv + /nix/store/zcfc2w942q3a6lpp77cmz64zdis9i1dz-torch-ext-bundle.drv + /nix/store/q2d20wl8cfvw82mp757i59cvq8z9wmpv-build-and-copy.drv +building '/nix/store/d04lffjyka9nfvrhmr8863813bwkcn0w-sage_attention-torch-ext.drv'... +building '/nix/store/g8li7ymzmry7mpxm9k43zkmhzrk2nxsz-sage_attention-torch-ext.drv'... +building '/nix/store/gaqkbs2b2k9x1yh88ax8vb5rnnl81xmy-sage_attention-torch-ext.drv'... +building '/nix/store/jlfw8d4bqv1cgckbkmjsam39djmgjsl1-sage_attention-torch-ext.drv'... +building '/nix/store/p79rdm4kvf0jr7vkv277nbi5mmc2lwyb-sage_attention-torch-ext.drv'... sage_attention-torch-ext> Sourcing get-kernel-check-hook.sh sage_attention-torch-ext> Sourcing setup-cuda-hook sage_attention-torch-ext> Sourcing get-kernel-check-hook.sh @@ -33,78 +33,77 @@ sage_attention-torch-ext> Sourcing setup-cuda-hook sage_attention-torch-ext> Sourcing get-kernel-check-hook.sh sage_attention-torch-ext> Sourcing setup-cuda-hook sage_attention-torch-ext> Running phase: unpackPhase -sage_attention-torch-ext> unpacking source archive /nix/store/zgm080lkrxljczr1rfx3aa781rzxzc4p-source -sage_attention-torch-ext> source root is source sage_attention-torch-ext> Running phase: unpackPhase -sage_attention-torch-ext> Running phase: patchPhase -sage_attention-torch-ext> unpacking source archive /nix/store/zgm080lkrxljczr1rfx3aa781rzxzc4p-source -sage_attention-torch-ext> source root is source -sage_attention-torch-ext> Running phase: updateAutotoolsGnuConfigScriptsPhase +sage_attention-torch-ext> unpacking source archive /nix/store/9ns9nazv5cyxcnpq9fw96ccinmp18kv1-source +sage_attention-torch-ext> unpacking source archive /nix/store/9ns9nazv5cyxcnpq9fw96ccinmp18kv1-source sage_attention-torch-ext> Running phase: unpackPhase +sage_attention-torch-ext> unpacking source archive /nix/store/9ns9nazv5cyxcnpq9fw96ccinmp18kv1-source +sage_attention-torch-ext> source root is source +sage_attention-torch-ext> source root is source +sage_attention-torch-ext> source root is source +sage_attention-torch-ext> Running phase: patchPhase sage_attention-torch-ext> Running phase: patchPhase -sage_attention-torch-ext> unpacking source archive /nix/store/zgm080lkrxljczr1rfx3aa781rzxzc4p-source +sage_attention-torch-ext> Running phase: patchPhase +sage_attention-torch-ext> Running phase: unpackPhase +sage_attention-torch-ext> Running phase: unpackPhase +sage_attention-torch-ext> unpacking source archive /nix/store/9ns9nazv5cyxcnpq9fw96ccinmp18kv1-source +sage_attention-torch-ext> Running phase: updateAutotoolsGnuConfigScriptsPhase +sage_attention-torch-ext> Running phase: updateAutotoolsGnuConfigScriptsPhase +sage_attention-torch-ext> Running phase: updateAutotoolsGnuConfigScriptsPhase +sage_attention-torch-ext> unpacking source archive /nix/store/9ns9nazv5cyxcnpq9fw96ccinmp18kv1-source +sage_attention-torch-ext> source root is source +sage_attention-torch-ext> Running phase: configurePhase +sage_attention-torch-ext> Running phase: configurePhase sage_attention-torch-ext> Running phase: configurePhase sage_attention-torch-ext> source root is source -sage_attention-torch-ext> Running phase: updateAutotoolsGnuConfigScriptsPhase -sage_attention-torch-ext> Running phase: unpackPhase -sage_attention-torch-ext> Executing setupCUDAToolkitCompilers sage_attention-torch-ext> Running phase: patchPhase +sage_attention-torch-ext> Running phase: patchPhase +sage_attention-torch-ext> Executing setupCUDAToolkitCompilers sage_attention-torch-ext> fixing cmake files... -sage_attention-torch-ext> unpacking source archive /nix/store/zgm080lkrxljczr1rfx3aa781rzxzc4p-source -sage_attention-torch-ext> Running phase: configurePhase -sage_attention-torch-ext> source root is source sage_attention-torch-ext> Running phase: updateAutotoolsGnuConfigScriptsPhase -sage_attention-torch-ext> Running phase: unpackPhase sage_attention-torch-ext> Executing setupCUDAToolkitCompilers -sage_attention-torch-ext> Running phase: patchPhase sage_attention-torch-ext> fixing cmake files... -sage_attention-torch-ext> unpacking source archive /nix/store/zgm080lkrxljczr1rfx3aa781rzxzc4p-source -sage_attention-torch-ext> Running phase: configurePhase -sage_attention-torch-ext> source root is source -sage_attention-torch-ext> Running phase: updateAutotoolsGnuConfigScriptsPhase sage_attention-torch-ext> Executing setupCUDAToolkitCompilers -sage_attention-torch-ext> Running phase: patchPhase +sage_attention-torch-ext> Running phase: updateAutotoolsGnuConfigScriptsPhase sage_attention-torch-ext> fixing cmake files... sage_attention-torch-ext> Running phase: configurePhase -sage_attention-torch-ext> cmake flags: -GNinja -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/share/doc/sage_attention -DCMAKE_INSTALL_INFODIR=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/share/man -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/include -DCMAKE_INSTALL_SBINDIR=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/strip -DCMAKE_RANLIB=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/ranlib -DCMAKE_AR=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext -DPython_EXECUTABLE:STRING=/nix/store/j6r6hpjs8p5m4s3i8cqqavg62fd5z48g-python3-3.13.6-env/bin/python -DCMAKE_CUDA_HOST_COMPILER:STRING=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/g++ -DNVCC_THREADS=3 -DCUDAToolkit_INCLUDE_DIR=/nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85/include\;/nix/store/5f6h6xs5c74iqcjda3y73i290mfwfs9x-cuda_nvml_dev-12.6.77-dev/include\;/nix/store/r26q9f2lhsvimxha44g1xcck3adrdqwg-cuda_nvrtc-12.6.85-dev/include\;/nix/store/nj1a061pvzpq9dr65yj3jpjqcx6pr4fq-cuda_nvtx-12.6.77-dev/include\;/nix/store/9ik1skjb698l6vkx4m4wvx2nrr4sx0na-libcufft-11.3.0.4-dev/include\;/nix/store/vl1dficb0blxzqg6xqzfi5p119jvl2vi-libcusolver-11.7.1.2-dev/include\;/nix/store/n7x9kkzi2jdfj6f6yjwywfhyfmn957zp-cuda_cupti-12.6.80-dev/include\;/nix/store/sskxmb670akk0avrahrl4r6hp7925zh8-cuda_cudart-12.6.77-dev/include\;/nix/store/8a9vz66yzsar01lpgipmzq8skyk3ymkp-cuda_cccl-12.6.77-dev/include\;/nix/store/xd2xrldv3lbg1bk93nr0yccy6j0vhh2k-cudnn-9.11.0.98-dev/include\;/nix/store/0w4g3rxgkw9r0lv737rslqdk7wldmi0n-libcurand-10.3.7.77-dev/include\;/nix/store/m0s4p867fk6wk8ba7ym9yff4mayqjhlw-libcusparse-12.5.4.2-dev/include\;/nix/store/blh9iyvjkmwd871mfjvfhnp7njwgnc6b-cuda_profiler_api-12.6.77-dev/include\;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev/include\;/nix/store/4pwy3k2s52ppzbs3k6d58kda8jhmiim4-libcufile-1.11.1.6-dev/include -DCUDAToolkit_ROOT=/nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85\;/nix/store/1qgrl2sgdj5m7llm2vs9690gd9998psq-cudnn-9.11.0.98\;/nix/store/d2z15dzsgfm4r2yyl16n3wc0sw8z6fia-cuda_cupti-12.6.80-lib\;/nix/store/86ngm5djfbl6a0i43j282680chqz1vr8-libcusparse-12.5.4.2-lib\;/nix/store/bmph9rbyqnyjs02zriwq78kg16h12wi6-libcublas-12.6.4.1-lib\;/nix/store/wny8xmyma0ziffas96ansxgmjfqpw393-cuda_nvrtc-12.6.85-lib\;/nix/store/j40ndiqjiqbiqrbfmgmkzz6w8757cgvk-cuda_nvml_dev-12.6.77-lib\;/nix/store/3ii532blh586xxavim32i21kr84wlcdc-cuda_profiler_api-12.6.77\;/nix/store/j32l8jnzckhdy2lzxgyd59y7p39y6b1d-libcusolver-11.7.1.2-static\;/nix/store/5iv2zpbf4k00ch4c5zfi5b8dlj90y3d3-cuda_cccl-12.6.77\;/nix/store/a8yi28jqv5185bbv10jpjja3x98i86hm-cuda_cudart-12.6.77-stubs\;/nix/store/ya85qn68jv6mlq6gh6phh5hwk3dkynag-cuda_cudart-12.6.77-static\;/nix/store/m65ribrsnk3gbabcx9ah6phgiil19j01-libcufile-1.11.1.6\;/nix/store/5f6h6xs5c74iqcjda3y73i290mfwfs9x-cuda_nvml_dev-12.6.77-dev\;/nix/store/r26q9f2lhsvimxha44g1xcck3adrdqwg-cuda_nvrtc-12.6.85-dev\;/nix/store/nj1a061pvzpq9dr65yj3jpjqcx6pr4fq-cuda_nvtx-12.6.77-dev\;/nix/store/bcvj4g3f3n6cpb6czcb5k8zdmyd94fwi-cuda_nvtx-12.6.77-lib\;/nix/store/9ik1skjb698l6vkx4m4wvx2nrr4sx0na-libcufft-11.3.0.4-dev\;/nix/store/k5rbpivsz3ilsxg91pgigp6la8ln3cv9-cuda_cupti-12.6.80\;/nix/store/vl1dficb0blxzqg6xqzfi5p119jvl2vi-libcusolver-11.7.1.2-dev\;/nix/store/f87x0n0gi2d7rxh1ja92za2ixcw60q2p-cuda_nvtx-12.6.77\;/nix/store/n7x9kkzi2jdfj6f6yjwywfhyfmn957zp-cuda_cupti-12.6.80-dev\;/nix/store/m0fwdgh4nmrjd0q9v4m2ly63qbcq2hi2-cuda_cudart-12.6.77\;/nix/store/qfaxx4b8l1alrrl0gbyb23k3j850c0v5-libcurand-10.3.7.77-static\;/nix/store/w1npzy8mfl28w7cib5idkg6nvlbzhpzq-libcufile-1.11.1.6-lib\;/nix/store/8abbm2gd77dv0l3acw0s18wln36aa0l5-cuda_cudart-12.6.77-lib\;/nix/store/ykb9bv2lqkf1wzy73q96cb04pybx9xa2-cuda_nvcc-12.6.85-static\;/nix/store/nw9ws2qvhgdb33qgfx4iqj517814qq8y-libcufft-11.3.0.4\;/nix/store/sskxmb670akk0avrahrl4r6hp7925zh8-cuda_cudart-12.6.77-dev\;/nix/store/mfc3ah6lwfd8dfbs77b0z9i75c471b0n-libcufft-11.3.0.4-static\;/nix/store/zk3cg1ws6cskrzyhdr5d68f8zrkfk77d-cuda_nvrtc-12.6.85-static\;/nix/store/pcrirrvn2ya5d3r1y18s2zj4pm2jladw-libcusolver-11.7.1.2\;/nix/store/qdn67x8jrwr418air16kwicya4d747pq-libcufft-11.3.0.4-lib\;/nix/store/dg8hyrzy7sh3wdhcr4ywsz05cvl6vfyc-libcusparse-12.5.4.2\;/nix/store/8a9vz66yzsar01lpgipmzq8skyk3ymkp-cuda_cccl-12.6.77-dev\;/nix/store/wmcrrdxd3db58nklyp7yf90kknfdx6b5-libcurand-10.3.7.77-lib\;/nix/store/xd2xrldv3lbg1bk93nr0yccy6j0vhh2k-cudnn-9.11.0.98-dev\;/nix/store/0w4g3rxgkw9r0lv737rslqdk7wldmi0n-libcurand-10.3.7.77-dev\;/nix/store/jr1397g6pshvil5n4lnvp7dm24dm71h8-libcublas-12.6.4.1-static\;/nix/store/wq0wv7df58h6bgggnz964sk8m1hbkxxp-cuda_cupti-12.6.80-sample\;/nix/store/m0s4p867fk6wk8ba7ym9yff4mayqjhlw-libcusparse-12.5.4.2-dev\;/nix/store/blh9iyvjkmwd871mfjvfhnp7njwgnc6b-cuda_profiler_api-12.6.77-dev\;/nix/store/ngwsphsxf906z7cgwg32d1w83p809ywl-cudnn-9.11.0.98-static\;/nix/store/07zlxn68jyf4s263xafnjid55grmi7a2-cuda_nvrtc-12.6.85\;/nix/store/zyh7hqq402zc7dhafhbh9vycyzcfq256-libcurand-10.3.7.77\;/nix/store/x7mww4k0zzzb7bnffv0b22jqbyf1mg3v-cuda_cupti-12.6.80-static\;/nix/store/xvlapjc6spss1kvbjlq97m6pk19hfrxz-cuda_nvml_dev-12.6.77\;/nix/store/7j4zf0r8flh7l4x5pm1mgqb2vcabmcdj-libcusolver-11.7.1.2-lib\;/nix/store/gs8gw8bgjccrjxlyzhxa7h85gkxgqwhn-libcufile-1.11.1.6-static\;/nix/store/p9dnsv7mv8mqm9aisrckq8lm3zs3l7dk-cudnn-9.11.0.98-lib\;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev\;/nix/store/dpska4iiya4xa5zzzmqzx3ljws73bnds-cuda_nvml_dev-12.6.77-static\;/nix/store/gzykkbwmch7pxgfzf86fg0b928lz6b36-libcusparse-12.5.4.2-static\;/nix/store/nqn7lvw8gbwbymdhz4nak9wf9b5bbah9-libcublas-12.6.4.1\;/nix/store/4pwy3k2s52ppzbs3k6d58kda8jhmiim4-libcufile-1.11.1.6-dev -DPROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXECUTABLE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DPYBIND11_PYTHONLIBS_OVERWRITE=OFF -DPYTHON_EXECUTABLE=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/bin/python3.13 -DPYTHON_INCLUDE_DIR=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/include/python3.13 -DPYTHON_SITE_PACKAGES=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/lib/python3.13/site-packages -sage_attention-torch-ext> Running phase: updateAutotoolsGnuConfigScriptsPhase +sage_attention-torch-ext> Running phase: configurePhase sage_attention-torch-ext> Executing setupCUDAToolkitCompilers sage_attention-torch-ext> fixing cmake files... -sage_attention-torch-ext> Running phase: configurePhase -sage_attention-torch-ext> cmake flags: -GNinja -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/share/doc/sage_attention -DCMAKE_INSTALL_INFODIR=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/share/man -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/include -DCMAKE_INSTALL_SBINDIR=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/strip -DCMAKE_RANLIB=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/ranlib -DCMAKE_AR=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext -DPython_EXECUTABLE:STRING=/nix/store/r3gwdvvsgl1csl12f4pkhz0jhsch7bdy-python3-3.13.6-env/bin/python -DCMAKE_CUDA_HOST_COMPILER:STRING=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/g++ -DNVCC_THREADS=3 -DCUDAToolkit_INCLUDE_DIR=/nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85/include\;/nix/store/5f6h6xs5c74iqcjda3y73i290mfwfs9x-cuda_nvml_dev-12.6.77-dev/include\;/nix/store/r26q9f2lhsvimxha44g1xcck3adrdqwg-cuda_nvrtc-12.6.85-dev/include\;/nix/store/9ik1skjb698l6vkx4m4wvx2nrr4sx0na-libcufft-11.3.0.4-dev/include\;/nix/store/vl1dficb0blxzqg6xqzfi5p119jvl2vi-libcusolver-11.7.1.2-dev/include\;/nix/store/n7x9kkzi2jdfj6f6yjwywfhyfmn957zp-cuda_cupti-12.6.80-dev/include\;/nix/store/sskxmb670akk0avrahrl4r6hp7925zh8-cuda_cudart-12.6.77-dev/include\;/nix/store/8a9vz66yzsar01lpgipmzq8skyk3ymkp-cuda_cccl-12.6.77-dev/include\;/nix/store/xd2xrldv3lbg1bk93nr0yccy6j0vhh2k-cudnn-9.11.0.98-dev/include\;/nix/store/0w4g3rxgkw9r0lv737rslqdk7wldmi0n-libcurand-10.3.7.77-dev/include\;/nix/store/m0s4p867fk6wk8ba7ym9yff4mayqjhlw-libcusparse-12.5.4.2-dev/include\;/nix/store/blh9iyvjkmwd871mfjvfhnp7njwgnc6b-cuda_profiler_api-12.6.77-dev/include\;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev/include\;/nix/store/4pwy3k2s52ppzbs3k6d58kda8jhmiim4-libcufile-1.11.1.6-dev/include -DCUDAToolkit_ROOT=/nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85\;/nix/store/1qgrl2sgdj5m7llm2vs9690gd9998psq-cudnn-9.11.0.98\;/nix/store/d2z15dzsgfm4r2yyl16n3wc0sw8z6fia-cuda_cupti-12.6.80-lib\;/nix/store/86ngm5djfbl6a0i43j282680chqz1vr8-libcusparse-12.5.4.2-lib\;/nix/store/bmph9rbyqnyjs02zriwq78kg16h12wi6-libcublas-12.6.4.1-lib\;/nix/store/wny8xmyma0ziffas96ansxgmjfqpw393-cuda_nvrtc-12.6.85-lib\;/nix/store/j40ndiqjiqbiqrbfmgmkzz6w8757cgvk-cuda_nvml_dev-12.6.77-lib\;/nix/store/3ii532blh586xxavim32i21kr84wlcdc-cuda_profiler_api-12.6.77\;/nix/store/j32l8jnzckhdy2lzxgyd59y7p39y6b1d-libcusolver-11.7.1.2-static\;/nix/store/5iv2zpbf4k00ch4c5zfi5b8dlj90y3d3-cuda_cccl-12.6.77\;/nix/store/a8yi28jqv5185bbv10jpjja3x98i86hm-cuda_cudart-12.6.77-stubs\;/nix/store/ya85qn68jv6mlq6gh6phh5hwk3dkynag-cuda_cudart-12.6.77-static\;/nix/store/m65ribrsnk3gbabcx9ah6phgiil19j01-libcufile-1.11.1.6\;/nix/store/5f6h6xs5c74iqcjda3y73i290mfwfs9x-cuda_nvml_dev-12.6.77-dev\;/nix/store/r26q9f2lhsvimxha44g1xcck3adrdqwg-cuda_nvrtc-12.6.85-dev\;/nix/store/9ik1skjb698l6vkx4m4wvx2nrr4sx0na-libcufft-11.3.0.4-dev\;/nix/store/k5rbpivsz3ilsxg91pgigp6la8ln3cv9-cuda_cupti-12.6.80\;/nix/store/vl1dficb0blxzqg6xqzfi5p119jvl2vi-libcusolver-11.7.1.2-dev\;/nix/store/n7x9kkzi2jdfj6f6yjwywfhyfmn957zp-cuda_cupti-12.6.80-dev\;/nix/store/m0fwdgh4nmrjd0q9v4m2ly63qbcq2hi2-cuda_cudart-12.6.77\;/nix/store/qfaxx4b8l1alrrl0gbyb23k3j850c0v5-libcurand-10.3.7.77-static\;/nix/store/w1npzy8mfl28w7cib5idkg6nvlbzhpzq-libcufile-1.11.1.6-lib\;/nix/store/8abbm2gd77dv0l3acw0s18wln36aa0l5-cuda_cudart-12.6.77-lib\;/nix/store/ykb9bv2lqkf1wzy73q96cb04pybx9xa2-cuda_nvcc-12.6.85-static\;/nix/store/nw9ws2qvhgdb33qgfx4iqj517814qq8y-libcufft-11.3.0.4\;/nix/store/sskxmb670akk0avrahrl4r6hp7925zh8-cuda_cudart-12.6.77-dev\;/nix/store/mfc3ah6lwfd8dfbs77b0z9i75c471b0n-libcufft-11.3.0.4-static\;/nix/store/zk3cg1ws6cskrzyhdr5d68f8zrkfk77d-cuda_nvrtc-12.6.85-static\;/nix/store/pcrirrvn2ya5d3r1y18s2zj4pm2jladw-libcusolver-11.7.1.2\;/nix/store/qdn67x8jrwr418air16kwicya4d747pq-libcufft-11.3.0.4-lib\;/nix/store/dg8hyrzy7sh3wdhcr4ywsz05cvl6vfyc-libcusparse-12.5.4.2\;/nix/store/8a9vz66yzsar01lpgipmzq8skyk3ymkp-cuda_cccl-12.6.77-dev\;/nix/store/wmcrrdxd3db58nklyp7yf90kknfdx6b5-libcurand-10.3.7.77-lib\;/nix/store/xd2xrldv3lbg1bk93nr0yccy6j0vhh2k-cudnn-9.11.0.98-dev\;/nix/store/0w4g3rxgkw9r0lv737rslqdk7wldmi0n-libcurand-10.3.7.77-dev\;/nix/store/jr1397g6pshvil5n4lnvp7dm24dm71h8-libcublas-12.6.4.1-static\;/nix/store/wq0wv7df58h6bgggnz964sk8m1hbkxxp-cuda_cupti-12.6.80-sample\;/nix/store/m0s4p867fk6wk8ba7ym9yff4mayqjhlw-libcusparse-12.5.4.2-dev\;/nix/store/blh9iyvjkmwd871mfjvfhnp7njwgnc6b-cuda_profiler_api-12.6.77-dev\;/nix/store/ngwsphsxf906z7cgwg32d1w83p809ywl-cudnn-9.11.0.98-static\;/nix/store/07zlxn68jyf4s263xafnjid55grmi7a2-cuda_nvrtc-12.6.85\;/nix/store/zyh7hqq402zc7dhafhbh9vycyzcfq256-libcurand-10.3.7.77\;/nix/store/x7mww4k0zzzb7bnffv0b22jqbyf1mg3v-cuda_cupti-12.6.80-static\;/nix/store/xvlapjc6spss1kvbjlq97m6pk19hfrxz-cuda_nvml_dev-12.6.77\;/nix/store/7j4zf0r8flh7l4x5pm1mgqb2vcabmcdj-libcusolver-11.7.1.2-lib\;/nix/store/gs8gw8bgjccrjxlyzhxa7h85gkxgqwhn-libcufile-1.11.1.6-static\;/nix/store/p9dnsv7mv8mqm9aisrckq8lm3zs3l7dk-cudnn-9.11.0.98-lib\;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev\;/nix/store/dpska4iiya4xa5zzzmqzx3ljws73bnds-cuda_nvml_dev-12.6.77-static\;/nix/store/gzykkbwmch7pxgfzf86fg0b928lz6b36-libcusparse-12.5.4.2-static\;/nix/store/nqn7lvw8gbwbymdhz4nak9wf9b5bbah9-libcublas-12.6.4.1\;/nix/store/4pwy3k2s52ppzbs3k6d58kda8jhmiim4-libcufile-1.11.1.6-dev -DPROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXECUTABLE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DPYBIND11_PYTHONLIBS_OVERWRITE=OFF -DPYTHON_EXECUTABLE=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/bin/python3.13 -DPYTHON_INCLUDE_DIR=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/include/python3.13 -DPYTHON_SITE_PACKAGES=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/lib/python3.13/site-packages sage_attention-torch-ext> Executing setupCUDAToolkitCompilers -sage_attention-torch-ext> cmake flags: -GNinja -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/share/doc/sage_attention -DCMAKE_INSTALL_INFODIR=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/share/man -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/include -DCMAKE_INSTALL_SBINDIR=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/strip -DCMAKE_RANLIB=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ranlib -DCMAKE_AR=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext -DPython_EXECUTABLE:STRING=/nix/store/aikr517kmcd8r2nrrj70jq71d7352qiq-python3-3.13.6-env/bin/python -DCMAKE_CUDA_HOST_COMPILER:STRING=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/g++ -DNVCC_THREADS=3 -DCUDAToolkit_INCLUDE_DIR=/nix/store/kky5wd8qwb0hx3jb3j9qc1bkwznw3z83-libcusparse-12.5.10.65-dev/include\;/nix/store/dd8wl3nnsigw2gj5bwaiswla97jpw1jz-libcublas-12.9.1.4-dev/include\;/nix/store/zsmc0yjbjrfbamm9ycrlz5yzi5hrbag1-libcurand-10.3.10.19-dev/include\;/nix/store/ip4lb9ximc445dbdkdvia4whx83g00g3-libcusolver-11.7.5.82-dev/include\;/nix/store/81xppf0rrqfasvg7wy4z891ab473nb9v-libcufile-1.14.1.1-dev/include\;/nix/store/nkvyh0qxbfj2wbm3r800xd6x1fhs1s4x-cuda_cccl-12.9.27-dev/include\;/nix/store/ik96pdimvw3bjj8wdr6laxycnn5lpwby-libcufft-11.4.1.4-dev/include\;/nix/store/f9r19xpj8qayy3b74gx3gbjrq0z1aq3b-cuda_nvml_dev-12.9.79-dev/include\;/nix/store/0kycn0pb0x46h16afxw2bjrm1gjq1355-cuda_profiler_api-12.9.79-dev/include\;/nix/store/z2xfln4d3r92hjjihlq5w6hvh5qhpcb4-cudnn-9.11.0.98-dev/include\;/nix/store/x4w41r4jyapqwdghvi6xrpd0mnim4x08-cuda_cudart-12.9.79-dev/include\;/nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86/include\;/nix/store/f21f8hghg4fiwa2ix29h1zy854p7q4v6-cuda_nvrtc-12.9.86-dev/include\;/nix/store/ns0brisbkgrjyfi16rlyjjgcym4jk6qv-cuda_cupti-12.9.79-dev/include -DCUDAToolkit_ROOT=/nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86\;/nix/store/q2al0drhrl0yxk97xbsjl8d0h25kmsq9-libcurand-10.3.10.19-lib\;/nix/store/ax1ssn45048qbmyy19basgv6q64y5jy0-cuda_cupti-12.9.79\;/nix/store/m09542l6q83flp3asv2r4j3wcbjqksvg-libcufile-1.14.1.1-static\;/nix/store/b3wbcra9cziq8bwf3yhmj2nn1mf5bqy2-cuda_cudart-12.9.79-lib\;/nix/store/j5kp5fg9mn6hhslk18wbmskc7v96l353-cuda_cupti-12.9.79-static\;/nix/store/kky5wd8qwb0hx3jb3j9qc1bkwznw3z83-libcusparse-12.5.10.65-dev\;/nix/store/dd8wl3nnsigw2gj5bwaiswla97jpw1jz-libcublas-12.9.1.4-dev\;/nix/store/zsmc0yjbjrfbamm9ycrlz5yzi5hrbag1-libcurand-10.3.10.19-dev\;/nix/store/3s79bz4ldkhlks6jf9a2jd4r34y6018b-libcurand-10.3.10.19\;/nix/store/v48xzq66pzmygxqkws17n9nvpa7lad9d-cuda_nvml_dev-12.9.79\;/nix/store/6via2axi1n31n685jii6dwaiqca8b2rc-cuda_nvcc-12.9.86-static\;/nix/store/v0hx9fqdlmz9kvjd9sqr2zc141ny10yn-cuda_profiler_api-12.9.79\;/nix/store/ip4lb9ximc445dbdkdvia4whx83g00g3-libcusolver-11.7.5.82-dev\;/nix/store/8cig7k11qv5g8x0j8n2mbdfzwrnf7cg2-cuda_cudart-12.9.79-stubs\;/nix/store/xg8pj5m74n2h3v8kgxbvmbpcl90rzmlx-cudnn-9.11.0.98-static\;/nix/store/v4b7mkhyq1akczzkcyynj7y9c61l9dc7-cuda_cudart-12.9.79-static\;/nix/store/hw2swakbrvi4innrymcw8i2m98p73br0-cuda_cupti-12.9.79-sample\;/nix/store/s1i2kadnni2m4skpzzqzfzc3bpmrxi7p-libcusparse-12.5.10.65-lib\;/nix/store/81xppf0rrqfasvg7wy4z891ab473nb9v-libcufile-1.14.1.1-dev\;/nix/store/0a83zdhkh2i9d97r4zqdn8fi8vn4wfk3-libcublas-12.9.1.4-static\;/nix/store/nkvyh0qxbfj2wbm3r800xd6x1fhs1s4x-cuda_cccl-12.9.27-dev\;/nix/store/jnhjz87sm9nbnb72n54jj2l99szrzpg2-libcusparse-12.5.10.65\;/nix/store/ik96pdimvw3bjj8wdr6laxycnn5lpwby-libcufft-11.4.1.4-dev\;/nix/store/d1m6c5i6y6ncjygpdmv1b4pmd91hvjr2-cuda_cupti-12.9.79-lib\;/nix/store/49p6af3v11dcxvq9andr6l8csa2sr4j4-cuda_nvrtc-12.9.86-static\;/nix/store/bfygrgghga26l7br5d5j3h6hd1s21rkn-cudnn-9.11.0.98\;/nix/store/a6an9chi5dvjsybrfrxql0bn76xswzpa-libcufft-11.4.1.4\;/nix/store/f9r19xpj8qayy3b74gx3gbjrq0z1aq3b-cuda_nvml_dev-12.9.79-dev\;/nix/store/7zy91byrxpnyzhjlwham2gqyir2x6f54-libcusolver-11.7.5.82-lib\;/nix/store/0kycn0pb0x46h16afxw2bjrm1gjq1355-cuda_profiler_api-12.9.79-dev\;/nix/store/cx0hyla7fkqqc5hh1gn4hkarjyjvbjhf-libcusparse-12.5.10.65-static\;/nix/store/3yi8kx62nklnyn77zn4z23hi03l9c7ff-libcusolver-11.7.5.82-static\;/nix/store/z2xfln4d3r92hjjihlq5w6hvh5qhpcb4-cudnn-9.11.0.98-dev\;/nix/store/86nq76ks8vlgjdsnh1hkskyfw7mm3plc-cuda_cccl-12.9.27\;/nix/store/01ywykdxfkvp64318anifgx7zaavz9ql-cuda_nvml_dev-12.9.79-lib\;/nix/store/qv2m9i0nby2p03xx37mkkm84dlqb9s84-cuda_cudart-12.9.79\;/nix/store/a09saq5rl5jxbgv9gqllx0080ypjk00x-libcufile-1.14.1.1-lib\;/nix/store/0l18n4dhavr0p4rk0nyqqjr8paacak13-libcufile-1.14.1.1\;/nix/store/r8ly0w88qv4gw3lhd784ha0ag221c23s-cuda_nvrtc-12.9.86-lib\;/nix/store/rngn6cls1blhilrw78xb3pjgwghibhzk-libcurand-10.3.10.19-static\;/nix/store/x4w41r4jyapqwdghvi6xrpd0mnim4x08-cuda_cudart-12.9.79-dev\;/nix/store/ikw7sqic4kknjkp50dr54khgs06q1hbv-cuda_nvml_dev-12.9.79-static\;/nix/store/bzdnjn29xj8a73wg16qrz0sswi9svp0x-libcublas-12.9.1.4\;/nix/store/62hqkwasnanq5i1j63z4clc0s4c61k1r-libcufft-11.4.1.4-static\;/nix/store/5sjldyn2vmm4ky24v1f9ggs0hps496q3-libcusolver-11.7.5.82\;/nix/store/9c924z3749bfm078bwq4ad12kjz46pjf-libcufft-11.4.1.4-lib\;/nix/store/f21f8hghg4fiwa2ix29h1zy854p7q4v6-cuda_nvrtc-12.9.86-dev\;/nix/store/c1kdvq8xqqkwzzazl99w20h4x9z0f9pc-libcublas-12.9.1.4-lib\;/nix/store/ns0brisbkgrjyfi16rlyjjgcym4jk6qv-cuda_cupti-12.9.79-dev\;/nix/store/h6kzw3gvlv4sa0apb4fflpjlirhj72ga-cudnn-9.11.0.98-lib\;/nix/store/f5gvpjis5y727lw6vzr2h1zkb3hm08k2-cuda_nvrtc-12.9.86 -DPROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXECUTABLE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DPYBIND11_PYTHONLIBS_OVERWRITE=OFF -DPYTHON_EXECUTABLE=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/bin/python3.13 -DPYTHON_INCLUDE_DIR=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/include/python3.13 -DPYTHON_SITE_PACKAGES=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/lib/python3.13/site-packages sage_attention-torch-ext> fixing cmake files... -sage_attention-torch-ext> cmake flags: -GNinja -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/share/doc/sage_attention -DCMAKE_INSTALL_INFODIR=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/share/man -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/include -DCMAKE_INSTALL_SBINDIR=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/strip -DCMAKE_RANLIB=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ranlib -DCMAKE_AR=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext -DPython_EXECUTABLE:STRING=/nix/store/qal2apcjwlw2p2kk05dwqdgzh8ml687l-python3-3.13.6-env/bin/python -DCMAKE_CUDA_HOST_COMPILER:STRING=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/g++ -DNVCC_THREADS=3 -DCUDAToolkit_INCLUDE_DIR=/nix/store/2dc9bgppqvyd6bd5m4j9zphiyhhd39lv-libcurand-10.3.9.90-dev/include\;/nix/store/x6d389mfn7v413ia2had715g7rdgghgm-cuda_nvrtc-12.8.93-dev/include\;/nix/store/4sz65s9xk80q9jij0i4zbp9xd1pmr3ja-libcusparse-12.5.8.93-dev/include\;/nix/store/11bshw90q985bpd9ds649qmgg0x54q7x-cudnn-9.11.0.98-dev/include\;/nix/store/8dwjdyr7y3dkqlgswpn9swz884lx62gf-cuda_cccl-12.8.90-dev/include\;/nix/store/4cq7zkla3djm6g5gkpzzx4gfikda2k7z-cuda_profiler_api-12.8.90-dev/include\;/nix/store/90nghg4zsrw6gki8y8hw4id3p31bc8rk-libcusolver-11.7.3.90-dev/include\;/nix/store/vg32acb8vlqyhkhabbgvmralfw0kwhi3-cuda_cudart-12.8.90-dev/include\;/nix/store/vqg4r8izl1fy2smmw4dwv4x1adkj0rfb-libcufft-11.3.3.83-dev/include\;/nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/include\;/nix/store/5pvax5f2dg278j43b4llkdxim9y0bjaf-cuda_nvml_dev-12.8.90-dev/include\;/nix/store/mps4gsnyk6s676zadvcykjxn08yghk5a-libcufile-1.13.1.3-dev/include\;/nix/store/gz9xyhflw755r8fcxkc816fp54sj0hl4-cuda_cupti-12.8.90-dev/include\;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev/include -DCUDAToolkit_ROOT=/nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93\;/nix/store/w96jlfiy431jnsww1x3ak3chhssa3i2s-libcusparse-12.5.8.93\;/nix/store/6zj6v3b9v8xdjs94iq1228slqwr757ij-libcublas-12.8.4.1\;/nix/store/q85pndpvaqdznfijmkn0mlfp8y3v08dl-cuda_cccl-12.8.90\;/nix/store/2dc9bgppqvyd6bd5m4j9zphiyhhd39lv-libcurand-10.3.9.90-dev\;/nix/store/cwy7010iwla9b2v1fx82sp66v12r913x-libcublas-12.8.4.1-lib\;/nix/store/x6d389mfn7v413ia2had715g7rdgghgm-cuda_nvrtc-12.8.93-dev\;/nix/store/22n25ss46s0hgspdp26qk025w9m393cd-libcublas-12.8.4.1-static\;/nix/store/sc5wnfvmk0j73xdppxj25kgk8s98lscs-cuda_nvrtc-12.8.93-lib\;/nix/store/54wqrrh6qbrwmv2wkz6b216ljrqbhcji-cudnn-9.11.0.98\;/nix/store/4sz65s9xk80q9jij0i4zbp9xd1pmr3ja-libcusparse-12.5.8.93-dev\;/nix/store/11bshw90q985bpd9ds649qmgg0x54q7x-cudnn-9.11.0.98-dev\;/nix/store/8dwjdyr7y3dkqlgswpn9swz884lx62gf-cuda_cccl-12.8.90-dev\;/nix/store/1v8m3gdw08hnbs7qa4jbkflm9lg1r5q6-libcurand-10.3.9.90\;/nix/store/jc58pv1cxhvpblrnzgaai60x04q6m0bp-cuda_nvml_dev-12.8.90-lib\;/nix/store/khwhv5d4kmzjpsm785iz3sva6i9sj9r5-libcufile-1.13.1.3-static\;/nix/store/xv6c2jcc3adyqks2xl28p4r0q1g4bc92-cuda_cupti-12.8.90\;/nix/store/a2h2yfjfx0si8smnqmghw7ccj0qbnv81-cuda_cupti-12.8.90-lib\;/nix/store/4cq7zkla3djm6g5gkpzzx4gfikda2k7z-cuda_profiler_api-12.8.90-dev\;/nix/store/xccbzbpcn8r506zdvhvbkqkilhlrh3c5-cuda_cudart-12.8.90-lib\;/nix/store/acbir62i1d7kvka4plmxsq8442z7r1l2-cuda_cudart-12.8.90-stubs\;/nix/store/ckkcbggf4x93zg3xn9xr00jgxs2x5p21-cuda_nvml_dev-12.8.90-static\;/nix/store/ml3bkm8bz1lnjmfd8lyxbjqpi1llasr2-libcusolver-11.7.3.90\;/nix/store/9zlrjnq7lisarny3llszk131vy816x2w-libcufile-1.13.1.3\;/nix/store/90nghg4zsrw6gki8y8hw4id3p31bc8rk-libcusolver-11.7.3.90-dev\;/nix/store/vg32acb8vlqyhkhabbgvmralfw0kwhi3-cuda_cudart-12.8.90-dev\;/nix/store/y27d2s3rcw8d17wcw23glhlj5rhs8d6y-cuda_cudart-12.8.90\;/nix/store/n96pib9yj31n031dmrrx43m61js1r5rn-cuda_nvcc-12.8.93-static\;/nix/store/pabakly3280dnghh3i89wklfm61raf7z-cuda_cupti-12.8.90-sample\;/nix/store/l0jiwp1f0dhigd41qqf408c5qyabz2vd-cudnn-9.11.0.98-static\;/nix/store/95lzbxp68m127n6hyllbr3dh2mlj7y8m-libcufft-11.3.3.83\;/nix/store/lxsd5l6hnqcfgqc1nsn8mmmpx385m3k8-libcusparse-12.5.8.93-lib\;/nix/store/vqg4r8izl1fy2smmw4dwv4x1adkj0rfb-libcufft-11.3.3.83-dev\;/nix/store/4b9rdinnksj1856siw3qmwi9f10480ii-cuda_nvrtc-12.8.93-static\;/nix/store/qh7zggir1ikzh3kvkhi2mqzpyisl4153-libcurand-10.3.9.90-static\;/nix/store/n25l4gcpw8cry4rg2a4c9jw3f53i65zd-libcusolver-11.7.3.90-lib\;/nix/store/xh73kc8spwfvd6w6wc63pyq3zm6qlrja-cuda_nvml_dev-12.8.90\;/nix/store/bgiqy1z8588hgcdzyh9brhc015w3nii0-libcurand-10.3.9.90-lib\;/nix/store/5pvax5f2dg278j43b4llkdxim9y0bjaf-cuda_nvml_dev-12.8.90-dev\;/nix/store/7lf23alvk7yh64flf2mj6smx66sqyz9d-libcufile-1.13.1.3-lib\;/nix/store/lfqj2ni7r0ir3n840b8r1lh63mnqr0ar-libcusparse-12.5.8.93-static\;/nix/store/qmw5pq21avnfvsk657k0zr4nsgwxa4jm-cuda_cudart-12.8.90-static\;/nix/store/826d39r2b4gwafqsyhvzq2bmqv8ygzrd-cuda_profiler_api-12.8.90\;/nix/store/g52lygjflrsyr6wahpf0rvs3fpna3wq9-cudnn-9.11.0.98-lib\;/nix/store/gxw5c9f7q2f1pmy0g1zyblb8p2p891a4-libcufft-11.3.3.83-lib\;/nix/store/pbsi8w1in7q44z83ndqsaxyzfrr2frgh-cuda_nvrtc-12.8.93\;/nix/store/mps4gsnyk6s676zadvcykjxn08yghk5a-libcufile-1.13.1.3-dev\;/nix/store/mvfnbb1m20fkv2n0j69ky9s9afn8p7h1-libcufft-11.3.3.83-static\;/nix/store/8byjxgnvhcyav2283wcxp752d8280c36-libcusolver-11.7.3.90-static\;/nix/store/gz9xyhflw755r8fcxkc816fp54sj0hl4-cuda_cupti-12.8.90-dev\;/nix/store/jyd8jp3q1d408n8842rb8g6ziviwm7q1-cuda_cupti-12.8.90-static\;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev -DPROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXECUTABLE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DPYBIND11_PYTHONLIBS_OVERWRITE=OFF -DPYTHON_EXECUTABLE=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/bin/python3.13 -DPYTHON_INCLUDE_DIR=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/include/python3.13 -DPYTHON_SITE_PACKAGES=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/lib/python3.13/site-packages -sage_attention-torch-ext> cmake flags: -GNinja -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/share/doc/sage_attention -DCMAKE_INSTALL_INFODIR=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/share/man -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/include -DCMAKE_INSTALL_SBINDIR=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/strip -DCMAKE_RANLIB=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ranlib -DCMAKE_AR=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext -DPython_EXECUTABLE:STRING=/nix/store/wirj6dihrpcch7idfd7jy4l0hqfsgkk1-python3-3.13.6-env/bin/python -DCMAKE_CUDA_HOST_COMPILER:STRING=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/g++ -DNVCC_THREADS=3 -DCUDAToolkit_INCLUDE_DIR=/nix/store/2dc9bgppqvyd6bd5m4j9zphiyhhd39lv-libcurand-10.3.9.90-dev/include\;/nix/store/x6d389mfn7v413ia2had715g7rdgghgm-cuda_nvrtc-12.8.93-dev/include\;/nix/store/4sz65s9xk80q9jij0i4zbp9xd1pmr3ja-libcusparse-12.5.8.93-dev/include\;/nix/store/11bshw90q985bpd9ds649qmgg0x54q7x-cudnn-9.11.0.98-dev/include\;/nix/store/8dwjdyr7y3dkqlgswpn9swz884lx62gf-cuda_cccl-12.8.90-dev/include\;/nix/store/4cq7zkla3djm6g5gkpzzx4gfikda2k7z-cuda_profiler_api-12.8.90-dev/include\;/nix/store/90nghg4zsrw6gki8y8hw4id3p31bc8rk-libcusolver-11.7.3.90-dev/include\;/nix/store/vg32acb8vlqyhkhabbgvmralfw0kwhi3-cuda_cudart-12.8.90-dev/include\;/nix/store/vqg4r8izl1fy2smmw4dwv4x1adkj0rfb-libcufft-11.3.3.83-dev/include\;/nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/include\;/nix/store/5pvax5f2dg278j43b4llkdxim9y0bjaf-cuda_nvml_dev-12.8.90-dev/include\;/nix/store/klis291y7cza60yzgkxzbid80bnyshmr-cuda_nvtx-12.8.90-dev/include\;/nix/store/mps4gsnyk6s676zadvcykjxn08yghk5a-libcufile-1.13.1.3-dev/include\;/nix/store/gz9xyhflw755r8fcxkc816fp54sj0hl4-cuda_cupti-12.8.90-dev/include\;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev/include -DCUDAToolkit_ROOT=/nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93\;/nix/store/w96jlfiy431jnsww1x3ak3chhssa3i2s-libcusparse-12.5.8.93\;/nix/store/6zj6v3b9v8xdjs94iq1228slqwr757ij-libcublas-12.8.4.1\;/nix/store/q85pndpvaqdznfijmkn0mlfp8y3v08dl-cuda_cccl-12.8.90\;/nix/store/2dc9bgppqvyd6bd5m4j9zphiyhhd39lv-libcurand-10.3.9.90-dev\;/nix/store/cwy7010iwla9b2v1fx82sp66v12r913x-libcublas-12.8.4.1-lib\;/nix/store/x6d389mfn7v413ia2had715g7rdgghgm-cuda_nvrtc-12.8.93-dev\;/nix/store/22n25ss46s0hgspdp26qk025w9m393cd-libcublas-12.8.4.1-static\;/nix/store/sc5wnfvmk0j73xdppxj25kgk8s98lscs-cuda_nvrtc-12.8.93-lib\;/nix/store/54wqrrh6qbrwmv2wkz6b216ljrqbhcji-cudnn-9.11.0.98\;/nix/store/4sz65s9xk80q9jij0i4zbp9xd1pmr3ja-libcusparse-12.5.8.93-dev\;/nix/store/11bshw90q985bpd9ds649qmgg0x54q7x-cudnn-9.11.0.98-dev\;/nix/store/8dwjdyr7y3dkqlgswpn9swz884lx62gf-cuda_cccl-12.8.90-dev\;/nix/store/1v8m3gdw08hnbs7qa4jbkflm9lg1r5q6-libcurand-10.3.9.90\;/nix/store/jc58pv1cxhvpblrnzgaai60x04q6m0bp-cuda_nvml_dev-12.8.90-lib\;/nix/store/khwhv5d4kmzjpsm785iz3sva6i9sj9r5-libcufile-1.13.1.3-static\;/nix/store/xv6c2jcc3adyqks2xl28p4r0q1g4bc92-cuda_cupti-12.8.90\;/nix/store/a2h2yfjfx0si8smnqmghw7ccj0qbnv81-cuda_cupti-12.8.90-lib\;/nix/store/4cq7zkla3djm6g5gkpzzx4gfikda2k7z-cuda_profiler_api-12.8.90-dev\;/nix/store/5f6dvklv5d0mvygrrf0vzp0smcn7kk01-cuda_nvtx-12.8.90\;/nix/store/xccbzbpcn8r506zdvhvbkqkilhlrh3c5-cuda_cudart-12.8.90-lib\;/nix/store/acbir62i1d7kvka4plmxsq8442z7r1l2-cuda_cudart-12.8.90-stubs\;/nix/store/ckkcbggf4x93zg3xn9xr00jgxs2x5p21-cuda_nvml_dev-12.8.90-static\;/nix/store/ml3bkm8bz1lnjmfd8lyxbjqpi1llasr2-libcusolver-11.7.3.90\;/nix/store/9zlrjnq7lisarny3llszk131vy816x2w-libcufile-1.13.1.3\;/nix/store/90nghg4zsrw6gki8y8hw4id3p31bc8rk-libcusolver-11.7.3.90-dev\;/nix/store/vg32acb8vlqyhkhabbgvmralfw0kwhi3-cuda_cudart-12.8.90-dev\;/nix/store/y27d2s3rcw8d17wcw23glhlj5rhs8d6y-cuda_cudart-12.8.90\;/nix/store/wa9pr3485k3mw8jhv7i9kfzjrqmdl5bb-cuda_nvtx-12.8.90-lib\;/nix/store/n96pib9yj31n031dmrrx43m61js1r5rn-cuda_nvcc-12.8.93-static\;/nix/store/pabakly3280dnghh3i89wklfm61raf7z-cuda_cupti-12.8.90-sample\;/nix/store/l0jiwp1f0dhigd41qqf408c5qyabz2vd-cudnn-9.11.0.98-static\;/nix/store/95lzbxp68m127n6hyllbr3dh2mlj7y8m-libcufft-11.3.3.83\;/nix/store/lxsd5l6hnqcfgqc1nsn8mmmpx385m3k8-libcusparse-12.5.8.93-lib\;/nix/store/vqg4r8izl1fy2smmw4dwv4x1adkj0rfb-libcufft-11.3.3.83-dev\;/nix/store/4b9rdinnksj1856siw3qmwi9f10480ii-cuda_nvrtc-12.8.93-static\;/nix/store/qh7zggir1ikzh3kvkhi2mqzpyisl4153-libcurand-10.3.9.90-static\;/nix/store/n25l4gcpw8cry4rg2a4c9jw3f53i65zd-libcusolver-11.7.3.90-lib\;/nix/store/xh73kc8spwfvd6w6wc63pyq3zm6qlrja-cuda_nvml_dev-12.8.90\;/nix/store/bgiqy1z8588hgcdzyh9brhc015w3nii0-libcurand-10.3.9.90-lib\;/nix/store/5pvax5f2dg278j43b4llkdxim9y0bjaf-cuda_nvml_dev-12.8.90-dev\;/nix/store/7lf23alvk7yh64flf2mj6smx66sqyz9d-libcufile-1.13.1.3-lib\;/nix/store/klis291y7cza60yzgkxzbid80bnyshmr-cuda_nvtx-12.8.90-dev\;/nix/store/lfqj2ni7r0ir3n840b8r1lh63mnqr0ar-libcusparse-12.5.8.93-static\;/nix/store/qmw5pq21avnfvsk657k0zr4nsgwxa4jm-cuda_cudart-12.8.90-static\;/nix/store/826d39r2b4gwafqsyhvzq2bmqv8ygzrd-cuda_profiler_api-12.8.90\;/nix/store/g52lygjflrsyr6wahpf0rvs3fpna3wq9-cudnn-9.11.0.98-lib\;/nix/store/gxw5c9f7q2f1pmy0g1zyblb8p2p891a4-libcufft-11.3.3.83-lib\;/nix/store/pbsi8w1in7q44z83ndqsaxyzfrr2frgh-cuda_nvrtc-12.8.93\;/nix/store/mps4gsnyk6s676zadvcykjxn08yghk5a-libcufile-1.13.1.3-dev\;/nix/store/mvfnbb1m20fkv2n0j69ky9s9afn8p7h1-libcufft-11.3.3.83-static\;/nix/store/8byjxgnvhcyav2283wcxp752d8280c36-libcusolver-11.7.3.90-static\;/nix/store/gz9xyhflw755r8fcxkc816fp54sj0hl4-cuda_cupti-12.8.90-dev\;/nix/store/jyd8jp3q1d408n8842rb8g6ziviwm7q1-cuda_cupti-12.8.90-static\;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev -DPROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXECUTABLE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DPYBIND11_PYTHONLIBS_OVERWRITE=OFF -DPYTHON_EXECUTABLE=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/bin/python3.13 -DPYTHON_INCLUDE_DIR=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/include/python3.13 -DPYTHON_SITE_PACKAGES=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/lib/python3.13/site-packages -sage_attention-torch-ext> -- The CXX compiler identification is GNU 13.4.0 -sage_attention-torch-ext> -- Detecting CXX compiler ABI info -sage_attention-torch-ext> -- The CXX compiler identification is GNU 13.4.0 -sage_attention-torch-ext> -- Detecting CXX compiler ABI info +sage_attention-torch-ext> cmake flags: -GNinja -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/share/doc/sage_attention -DCMAKE_INSTALL_INFODIR=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/share/man -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/include -DCMAKE_INSTALL_SBINDIR=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/strip -DCMAKE_RANLIB=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ranlib -DCMAKE_AR=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext -DPython_EXECUTABLE:STRING=/nix/store/aikr517kmcd8r2nrrj70jq71d7352qiq-python3-3.13.6-env/bin/python -DCMAKE_CUDA_HOST_COMPILER:STRING=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/g++ -DNVCC_THREADS=3 -DCUDAToolkit_INCLUDE_DIR=/nix/store/kky5wd8qwb0hx3jb3j9qc1bkwznw3z83-libcusparse-12.5.10.65-dev/include\;/nix/store/dd8wl3nnsigw2gj5bwaiswla97jpw1jz-libcublas-12.9.1.4-dev/include\;/nix/store/zsmc0yjbjrfbamm9ycrlz5yzi5hrbag1-libcurand-10.3.10.19-dev/include\;/nix/store/ip4lb9ximc445dbdkdvia4whx83g00g3-libcusolver-11.7.5.82-dev/include\;/nix/store/81xppf0rrqfasvg7wy4z891ab473nb9v-libcufile-1.14.1.1-dev/include\;/nix/store/nkvyh0qxbfj2wbm3r800xd6x1fhs1s4x-cuda_cccl-12.9.27-dev/include\;/nix/store/ik96pdimvw3bjj8wdr6laxycnn5lpwby-libcufft-11.4.1.4-dev/include\;/nix/store/f9r19xpj8qayy3b74gx3gbjrq0z1aq3b-cuda_nvml_dev-12.9.79-dev/include\;/nix/store/0kycn0pb0x46h16afxw2bjrm1gjq1355-cuda_profiler_api-12.9.79-dev/include\;/nix/store/z2xfln4d3r92hjjihlq5w6hvh5qhpcb4-cudnn-9.11.0.98-dev/include\;/nix/store/x4w41r4jyapqwdghvi6xrpd0mnim4x08-cuda_cudart-12.9.79-dev/include\;/nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86/include\;/nix/store/f21f8hghg4fiwa2ix29h1zy854p7q4v6-cuda_nvrtc-12.9.86-dev/include\;/nix/store/ns0brisbkgrjyfi16rlyjjgcym4jk6qv-cuda_cupti-12.9.79-dev/include -DCUDAToolkit_ROOT=/nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86\;/nix/store/q2al0drhrl0yxk97xbsjl8d0h25kmsq9-libcurand-10.3.10.19-lib\;/nix/store/ax1ssn45048qbmyy19basgv6q64y5jy0-cuda_cupti-12.9.79\;/nix/store/m09542l6q83flp3asv2r4j3wcbjqksvg-libcufile-1.14.1.1-static\;/nix/store/b3wbcra9cziq8bwf3yhmj2nn1mf5bqy2-cuda_cudart-12.9.79-lib\;/nix/store/j5kp5fg9mn6hhslk18wbmskc7v96l353-cuda_cupti-12.9.79-static\;/nix/store/kky5wd8qwb0hx3jb3j9qc1bkwznw3z83-libcusparse-12.5.10.65-dev\;/nix/store/dd8wl3nnsigw2gj5bwaiswla97jpw1jz-libcublas-12.9.1.4-dev\;/nix/store/zsmc0yjbjrfbamm9ycrlz5yzi5hrbag1-libcurand-10.3.10.19-dev\;/nix/store/3s79bz4ldkhlks6jf9a2jd4r34y6018b-libcurand-10.3.10.19\;/nix/store/v48xzq66pzmygxqkws17n9nvpa7lad9d-cuda_nvml_dev-12.9.79\;/nix/store/6via2axi1n31n685jii6dwaiqca8b2rc-cuda_nvcc-12.9.86-static\;/nix/store/v0hx9fqdlmz9kvjd9sqr2zc141ny10yn-cuda_profiler_api-12.9.79\;/nix/store/ip4lb9ximc445dbdkdvia4whx83g00g3-libcusolver-11.7.5.82-dev\;/nix/store/8cig7k11qv5g8x0j8n2mbdfzwrnf7cg2-cuda_cudart-12.9.79-stubs\;/nix/store/xg8pj5m74n2h3v8kgxbvmbpcl90rzmlx-cudnn-9.11.0.98-static\;/nix/store/v4b7mkhyq1akczzkcyynj7y9c61l9dc7-cuda_cudart-12.9.79-static\;/nix/store/hw2swakbrvi4innrymcw8i2m98p73br0-cuda_cupti-12.9.79-sample\;/nix/store/s1i2kadnni2m4skpzzqzfzc3bpmrxi7p-libcusparse-12.5.10.65-lib\;/nix/store/81xppf0rrqfasvg7wy4z891ab473nb9v-libcufile-1.14.1.1-dev\;/nix/store/0a83zdhkh2i9d97r4zqdn8fi8vn4wfk3-libcublas-12.9.1.4-static\;/nix/store/nkvyh0qxbfj2wbm3r800xd6x1fhs1s4x-cuda_cccl-12.9.27-dev\;/nix/store/jnhjz87sm9nbnb72n54jj2l99szrzpg2-libcusparse-12.5.10.65\;/nix/store/ik96pdimvw3bjj8wdr6laxycnn5lpwby-libcufft-11.4.1.4-dev\;/nix/store/d1m6c5i6y6ncjygpdmv1b4pmd91hvjr2-cuda_cupti-12.9.79-lib\;/nix/store/49p6af3v11dcxvq9andr6l8csa2sr4j4-cuda_nvrtc-12.9.86-static\;/nix/store/bfygrgghga26l7br5d5j3h6hd1s21rkn-cudnn-9.11.0.98\;/nix/store/a6an9chi5dvjsybrfrxql0bn76xswzpa-libcufft-11.4.1.4\;/nix/store/f9r19xpj8qayy3b74gx3gbjrq0z1aq3b-cuda_nvml_dev-12.9.79-dev\;/nix/store/7zy91byrxpnyzhjlwham2gqyir2x6f54-libcusolver-11.7.5.82-lib\;/nix/store/0kycn0pb0x46h16afxw2bjrm1gjq1355-cuda_profiler_api-12.9.79-dev\;/nix/store/cx0hyla7fkqqc5hh1gn4hkarjyjvbjhf-libcusparse-12.5.10.65-static\;/nix/store/3yi8kx62nklnyn77zn4z23hi03l9c7ff-libcusolver-11.7.5.82-static\;/nix/store/z2xfln4d3r92hjjihlq5w6hvh5qhpcb4-cudnn-9.11.0.98-dev\;/nix/store/86nq76ks8vlgjdsnh1hkskyfw7mm3plc-cuda_cccl-12.9.27\;/nix/store/01ywykdxfkvp64318anifgx7zaavz9ql-cuda_nvml_dev-12.9.79-lib\;/nix/store/qv2m9i0nby2p03xx37mkkm84dlqb9s84-cuda_cudart-12.9.79\;/nix/store/a09saq5rl5jxbgv9gqllx0080ypjk00x-libcufile-1.14.1.1-lib\;/nix/store/0l18n4dhavr0p4rk0nyqqjr8paacak13-libcufile-1.14.1.1\;/nix/store/r8ly0w88qv4gw3lhd784ha0ag221c23s-cuda_nvrtc-12.9.86-lib\;/nix/store/rngn6cls1blhilrw78xb3pjgwghibhzk-libcurand-10.3.10.19-static\;/nix/store/x4w41r4jyapqwdghvi6xrpd0mnim4x08-cuda_cudart-12.9.79-dev\;/nix/store/ikw7sqic4kknjkp50dr54khgs06q1hbv-cuda_nvml_dev-12.9.79-static\;/nix/store/bzdnjn29xj8a73wg16qrz0sswi9svp0x-libcublas-12.9.1.4\;/nix/store/62hqkwasnanq5i1j63z4clc0s4c61k1r-libcufft-11.4.1.4-static\;/nix/store/5sjldyn2vmm4ky24v1f9ggs0hps496q3-libcusolver-11.7.5.82\;/nix/store/9c924z3749bfm078bwq4ad12kjz46pjf-libcufft-11.4.1.4-lib\;/nix/store/f21f8hghg4fiwa2ix29h1zy854p7q4v6-cuda_nvrtc-12.9.86-dev\;/nix/store/c1kdvq8xqqkwzzazl99w20h4x9z0f9pc-libcublas-12.9.1.4-lib\;/nix/store/ns0brisbkgrjyfi16rlyjjgcym4jk6qv-cuda_cupti-12.9.79-dev\;/nix/store/h6kzw3gvlv4sa0apb4fflpjlirhj72ga-cudnn-9.11.0.98-lib\;/nix/store/f5gvpjis5y727lw6vzr2h1zkb3hm08k2-cuda_nvrtc-12.9.86 -DPROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXECUTABLE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DPYBIND11_PYTHONLIBS_OVERWRITE=OFF -DPYTHON_EXECUTABLE=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/bin/python3.13 -DPYTHON_INCLUDE_DIR=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/include/python3.13 -DPYTHON_SITE_PACKAGES=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/lib/python3.13/site-packages +sage_attention-torch-ext> cmake flags: -GNinja -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/share/doc/sage_attention -DCMAKE_INSTALL_INFODIR=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/share/man -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/include -DCMAKE_INSTALL_SBINDIR=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/strip -DCMAKE_RANLIB=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ranlib -DCMAKE_AR=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext -DPython_EXECUTABLE:STRING=/nix/store/qal2apcjwlw2p2kk05dwqdgzh8ml687l-python3-3.13.6-env/bin/python -DCMAKE_CUDA_HOST_COMPILER:STRING=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/g++ -DNVCC_THREADS=3 -DCUDAToolkit_INCLUDE_DIR=/nix/store/2dc9bgppqvyd6bd5m4j9zphiyhhd39lv-libcurand-10.3.9.90-dev/include\;/nix/store/x6d389mfn7v413ia2had715g7rdgghgm-cuda_nvrtc-12.8.93-dev/include\;/nix/store/4sz65s9xk80q9jij0i4zbp9xd1pmr3ja-libcusparse-12.5.8.93-dev/include\;/nix/store/11bshw90q985bpd9ds649qmgg0x54q7x-cudnn-9.11.0.98-dev/include\;/nix/store/8dwjdyr7y3dkqlgswpn9swz884lx62gf-cuda_cccl-12.8.90-dev/include\;/nix/store/4cq7zkla3djm6g5gkpzzx4gfikda2k7z-cuda_profiler_api-12.8.90-dev/include\;/nix/store/90nghg4zsrw6gki8y8hw4id3p31bc8rk-libcusolver-11.7.3.90-dev/include\;/nix/store/vg32acb8vlqyhkhabbgvmralfw0kwhi3-cuda_cudart-12.8.90-dev/include\;/nix/store/vqg4r8izl1fy2smmw4dwv4x1adkj0rfb-libcufft-11.3.3.83-dev/include\;/nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/include\;/nix/store/5pvax5f2dg278j43b4llkdxim9y0bjaf-cuda_nvml_dev-12.8.90-dev/include\;/nix/store/mps4gsnyk6s676zadvcykjxn08yghk5a-libcufile-1.13.1.3-dev/include\;/nix/store/gz9xyhflw755r8fcxkc816fp54sj0hl4-cuda_cupti-12.8.90-dev/include\;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev/include -DCUDAToolkit_ROOT=/nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93\;/nix/store/w96jlfiy431jnsww1x3ak3chhssa3i2s-libcusparse-12.5.8.93\;/nix/store/6zj6v3b9v8xdjs94iq1228slqwr757ij-libcublas-12.8.4.1\;/nix/store/q85pndpvaqdznfijmkn0mlfp8y3v08dl-cuda_cccl-12.8.90\;/nix/store/2dc9bgppqvyd6bd5m4j9zphiyhhd39lv-libcurand-10.3.9.90-dev\;/nix/store/cwy7010iwla9b2v1fx82sp66v12r913x-libcublas-12.8.4.1-lib\;/nix/store/x6d389mfn7v413ia2had715g7rdgghgm-cuda_nvrtc-12.8.93-dev\;/nix/store/22n25ss46s0hgspdp26qk025w9m393cd-libcublas-12.8.4.1-static\;/nix/store/sc5wnfvmk0j73xdppxj25kgk8s98lscs-cuda_nvrtc-12.8.93-lib\;/nix/store/54wqrrh6qbrwmv2wkz6b216ljrqbhcji-cudnn-9.11.0.98\;/nix/store/4sz65s9xk80q9jij0i4zbp9xd1pmr3ja-libcusparse-12.5.8.93-dev\;/nix/store/11bshw90q985bpd9ds649qmgg0x54q7x-cudnn-9.11.0.98-dev\;/nix/store/8dwjdyr7y3dkqlgswpn9swz884lx62gf-cuda_cccl-12.8.90-dev\;/nix/store/1v8m3gdw08hnbs7qa4jbkflm9lg1r5q6-libcurand-10.3.9.90\;/nix/store/jc58pv1cxhvpblrnzgaai60x04q6m0bp-cuda_nvml_dev-12.8.90-lib\;/nix/store/khwhv5d4kmzjpsm785iz3sva6i9sj9r5-libcufile-1.13.1.3-static\;/nix/store/xv6c2jcc3adyqks2xl28p4r0q1g4bc92-cuda_cupti-12.8.90\;/nix/store/a2h2yfjfx0si8smnqmghw7ccj0qbnv81-cuda_cupti-12.8.90-lib\;/nix/store/4cq7zkla3djm6g5gkpzzx4gfikda2k7z-cuda_profiler_api-12.8.90-dev\;/nix/store/xccbzbpcn8r506zdvhvbkqkilhlrh3c5-cuda_cudart-12.8.90-lib\;/nix/store/acbir62i1d7kvka4plmxsq8442z7r1l2-cuda_cudart-12.8.90-stubs\;/nix/store/ckkcbggf4x93zg3xn9xr00jgxs2x5p21-cuda_nvml_dev-12.8.90-static\;/nix/store/ml3bkm8bz1lnjmfd8lyxbjqpi1llasr2-libcusolver-11.7.3.90\;/nix/store/9zlrjnq7lisarny3llszk131vy816x2w-libcufile-1.13.1.3\;/nix/store/90nghg4zsrw6gki8y8hw4id3p31bc8rk-libcusolver-11.7.3.90-dev\;/nix/store/vg32acb8vlqyhkhabbgvmralfw0kwhi3-cuda_cudart-12.8.90-dev\;/nix/store/y27d2s3rcw8d17wcw23glhlj5rhs8d6y-cuda_cudart-12.8.90\;/nix/store/n96pib9yj31n031dmrrx43m61js1r5rn-cuda_nvcc-12.8.93-static\;/nix/store/pabakly3280dnghh3i89wklfm61raf7z-cuda_cupti-12.8.90-sample\;/nix/store/l0jiwp1f0dhigd41qqf408c5qyabz2vd-cudnn-9.11.0.98-static\;/nix/store/95lzbxp68m127n6hyllbr3dh2mlj7y8m-libcufft-11.3.3.83\;/nix/store/lxsd5l6hnqcfgqc1nsn8mmmpx385m3k8-libcusparse-12.5.8.93-lib\;/nix/store/vqg4r8izl1fy2smmw4dwv4x1adkj0rfb-libcufft-11.3.3.83-dev\;/nix/store/4b9rdinnksj1856siw3qmwi9f10480ii-cuda_nvrtc-12.8.93-static\;/nix/store/qh7zggir1ikzh3kvkhi2mqzpyisl4153-libcurand-10.3.9.90-static\;/nix/store/n25l4gcpw8cry4rg2a4c9jw3f53i65zd-libcusolver-11.7.3.90-lib\;/nix/store/xh73kc8spwfvd6w6wc63pyq3zm6qlrja-cuda_nvml_dev-12.8.90\;/nix/store/bgiqy1z8588hgcdzyh9brhc015w3nii0-libcurand-10.3.9.90-lib\;/nix/store/5pvax5f2dg278j43b4llkdxim9y0bjaf-cuda_nvml_dev-12.8.90-dev\;/nix/store/7lf23alvk7yh64flf2mj6smx66sqyz9d-libcufile-1.13.1.3-lib\;/nix/store/lfqj2ni7r0ir3n840b8r1lh63mnqr0ar-libcusparse-12.5.8.93-static\;/nix/store/qmw5pq21avnfvsk657k0zr4nsgwxa4jm-cuda_cudart-12.8.90-static\;/nix/store/826d39r2b4gwafqsyhvzq2bmqv8ygzrd-cuda_profiler_api-12.8.90\;/nix/store/g52lygjflrsyr6wahpf0rvs3fpna3wq9-cudnn-9.11.0.98-lib\;/nix/store/gxw5c9f7q2f1pmy0g1zyblb8p2p891a4-libcufft-11.3.3.83-lib\;/nix/store/pbsi8w1in7q44z83ndqsaxyzfrr2frgh-cuda_nvrtc-12.8.93\;/nix/store/mps4gsnyk6s676zadvcykjxn08yghk5a-libcufile-1.13.1.3-dev\;/nix/store/mvfnbb1m20fkv2n0j69ky9s9afn8p7h1-libcufft-11.3.3.83-static\;/nix/store/8byjxgnvhcyav2283wcxp752d8280c36-libcusolver-11.7.3.90-static\;/nix/store/gz9xyhflw755r8fcxkc816fp54sj0hl4-cuda_cupti-12.8.90-dev\;/nix/store/jyd8jp3q1d408n8842rb8g6ziviwm7q1-cuda_cupti-12.8.90-static\;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev -DPROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXECUTABLE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DPYBIND11_PYTHONLIBS_OVERWRITE=OFF -DPYTHON_EXECUTABLE=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/bin/python3.13 -DPYTHON_INCLUDE_DIR=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/include/python3.13 -DPYTHON_SITE_PACKAGES=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/lib/python3.13/site-packages +sage_attention-torch-ext> cmake flags: -GNinja -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/share/doc/sage_attention -DCMAKE_INSTALL_INFODIR=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/share/man -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/include -DCMAKE_INSTALL_SBINDIR=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/strip -DCMAKE_RANLIB=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/ranlib -DCMAKE_AR=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext -DPython_EXECUTABLE:STRING=/nix/store/r3gwdvvsgl1csl12f4pkhz0jhsch7bdy-python3-3.13.6-env/bin/python -DCMAKE_CUDA_HOST_COMPILER:STRING=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/g++ -DNVCC_THREADS=3 -DCUDAToolkit_INCLUDE_DIR=/nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85/include\;/nix/store/5f6h6xs5c74iqcjda3y73i290mfwfs9x-cuda_nvml_dev-12.6.77-dev/include\;/nix/store/r26q9f2lhsvimxha44g1xcck3adrdqwg-cuda_nvrtc-12.6.85-dev/include\;/nix/store/9ik1skjb698l6vkx4m4wvx2nrr4sx0na-libcufft-11.3.0.4-dev/include\;/nix/store/vl1dficb0blxzqg6xqzfi5p119jvl2vi-libcusolver-11.7.1.2-dev/include\;/nix/store/n7x9kkzi2jdfj6f6yjwywfhyfmn957zp-cuda_cupti-12.6.80-dev/include\;/nix/store/sskxmb670akk0avrahrl4r6hp7925zh8-cuda_cudart-12.6.77-dev/include\;/nix/store/8a9vz66yzsar01lpgipmzq8skyk3ymkp-cuda_cccl-12.6.77-dev/include\;/nix/store/xd2xrldv3lbg1bk93nr0yccy6j0vhh2k-cudnn-9.11.0.98-dev/include\;/nix/store/0w4g3rxgkw9r0lv737rslqdk7wldmi0n-libcurand-10.3.7.77-dev/include\;/nix/store/m0s4p867fk6wk8ba7ym9yff4mayqjhlw-libcusparse-12.5.4.2-dev/include\;/nix/store/blh9iyvjkmwd871mfjvfhnp7njwgnc6b-cuda_profiler_api-12.6.77-dev/include\;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev/include\;/nix/store/4pwy3k2s52ppzbs3k6d58kda8jhmiim4-libcufile-1.11.1.6-dev/include -DCUDAToolkit_ROOT=/nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85\;/nix/store/1qgrl2sgdj5m7llm2vs9690gd9998psq-cudnn-9.11.0.98\;/nix/store/d2z15dzsgfm4r2yyl16n3wc0sw8z6fia-cuda_cupti-12.6.80-lib\;/nix/store/86ngm5djfbl6a0i43j282680chqz1vr8-libcusparse-12.5.4.2-lib\;/nix/store/bmph9rbyqnyjs02zriwq78kg16h12wi6-libcublas-12.6.4.1-lib\;/nix/store/wny8xmyma0ziffas96ansxgmjfqpw393-cuda_nvrtc-12.6.85-lib\;/nix/store/j40ndiqjiqbiqrbfmgmkzz6w8757cgvk-cuda_nvml_dev-12.6.77-lib\;/nix/store/3ii532blh586xxavim32i21kr84wlcdc-cuda_profiler_api-12.6.77\;/nix/store/j32l8jnzckhdy2lzxgyd59y7p39y6b1d-libcusolver-11.7.1.2-static\;/nix/store/5iv2zpbf4k00ch4c5zfi5b8dlj90y3d3-cuda_cccl-12.6.77\;/nix/store/a8yi28jqv5185bbv10jpjja3x98i86hm-cuda_cudart-12.6.77-stubs\;/nix/store/ya85qn68jv6mlq6gh6phh5hwk3dkynag-cuda_cudart-12.6.77-static\;/nix/store/m65ribrsnk3gbabcx9ah6phgiil19j01-libcufile-1.11.1.6\;/nix/store/5f6h6xs5c74iqcjda3y73i290mfwfs9x-cuda_nvml_dev-12.6.77-dev\;/nix/store/r26q9f2lhsvimxha44g1xcck3adrdqwg-cuda_nvrtc-12.6.85-dev\;/nix/store/9ik1skjb698l6vkx4m4wvx2nrr4sx0na-libcufft-11.3.0.4-dev\;/nix/store/k5rbpivsz3ilsxg91pgigp6la8ln3cv9-cuda_cupti-12.6.80\;/nix/store/vl1dficb0blxzqg6xqzfi5p119jvl2vi-libcusolver-11.7.1.2-dev\;/nix/store/n7x9kkzi2jdfj6f6yjwywfhyfmn957zp-cuda_cupti-12.6.80-dev\;/nix/store/m0fwdgh4nmrjd0q9v4m2ly63qbcq2hi2-cuda_cudart-12.6.77\;/nix/store/qfaxx4b8l1alrrl0gbyb23k3j850c0v5-libcurand-10.3.7.77-static\;/nix/store/w1npzy8mfl28w7cib5idkg6nvlbzhpzq-libcufile-1.11.1.6-lib\;/nix/store/8abbm2gd77dv0l3acw0s18wln36aa0l5-cuda_cudart-12.6.77-lib\;/nix/store/ykb9bv2lqkf1wzy73q96cb04pybx9xa2-cuda_nvcc-12.6.85-static\;/nix/store/nw9ws2qvhgdb33qgfx4iqj517814qq8y-libcufft-11.3.0.4\;/nix/store/sskxmb670akk0avrahrl4r6hp7925zh8-cuda_cudart-12.6.77-dev\;/nix/store/mfc3ah6lwfd8dfbs77b0z9i75c471b0n-libcufft-11.3.0.4-static\;/nix/store/zk3cg1ws6cskrzyhdr5d68f8zrkfk77d-cuda_nvrtc-12.6.85-static\;/nix/store/pcrirrvn2ya5d3r1y18s2zj4pm2jladw-libcusolver-11.7.1.2\;/nix/store/qdn67x8jrwr418air16kwicya4d747pq-libcufft-11.3.0.4-lib\;/nix/store/dg8hyrzy7sh3wdhcr4ywsz05cvl6vfyc-libcusparse-12.5.4.2\;/nix/store/8a9vz66yzsar01lpgipmzq8skyk3ymkp-cuda_cccl-12.6.77-dev\;/nix/store/wmcrrdxd3db58nklyp7yf90kknfdx6b5-libcurand-10.3.7.77-lib\;/nix/store/xd2xrldv3lbg1bk93nr0yccy6j0vhh2k-cudnn-9.11.0.98-dev\;/nix/store/0w4g3rxgkw9r0lv737rslqdk7wldmi0n-libcurand-10.3.7.77-dev\;/nix/store/jr1397g6pshvil5n4lnvp7dm24dm71h8-libcublas-12.6.4.1-static\;/nix/store/wq0wv7df58h6bgggnz964sk8m1hbkxxp-cuda_cupti-12.6.80-sample\;/nix/store/m0s4p867fk6wk8ba7ym9yff4mayqjhlw-libcusparse-12.5.4.2-dev\;/nix/store/blh9iyvjkmwd871mfjvfhnp7njwgnc6b-cuda_profiler_api-12.6.77-dev\;/nix/store/ngwsphsxf906z7cgwg32d1w83p809ywl-cudnn-9.11.0.98-static\;/nix/store/07zlxn68jyf4s263xafnjid55grmi7a2-cuda_nvrtc-12.6.85\;/nix/store/zyh7hqq402zc7dhafhbh9vycyzcfq256-libcurand-10.3.7.77\;/nix/store/x7mww4k0zzzb7bnffv0b22jqbyf1mg3v-cuda_cupti-12.6.80-static\;/nix/store/xvlapjc6spss1kvbjlq97m6pk19hfrxz-cuda_nvml_dev-12.6.77\;/nix/store/7j4zf0r8flh7l4x5pm1mgqb2vcabmcdj-libcusolver-11.7.1.2-lib\;/nix/store/gs8gw8bgjccrjxlyzhxa7h85gkxgqwhn-libcufile-1.11.1.6-static\;/nix/store/p9dnsv7mv8mqm9aisrckq8lm3zs3l7dk-cudnn-9.11.0.98-lib\;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev\;/nix/store/dpska4iiya4xa5zzzmqzx3ljws73bnds-cuda_nvml_dev-12.6.77-static\;/nix/store/gzykkbwmch7pxgfzf86fg0b928lz6b36-libcusparse-12.5.4.2-static\;/nix/store/nqn7lvw8gbwbymdhz4nak9wf9b5bbah9-libcublas-12.6.4.1\;/nix/store/4pwy3k2s52ppzbs3k6d58kda8jhmiim4-libcufile-1.11.1.6-dev -DPROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXECUTABLE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DPYBIND11_PYTHONLIBS_OVERWRITE=OFF -DPYTHON_EXECUTABLE=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/bin/python3.13 -DPYTHON_INCLUDE_DIR=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/include/python3.13 -DPYTHON_SITE_PACKAGES=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/lib/python3.13/site-packages +sage_attention-torch-ext> cmake flags: -GNinja -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/share/doc/sage_attention -DCMAKE_INSTALL_INFODIR=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/share/man -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/include -DCMAKE_INSTALL_SBINDIR=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/strip -DCMAKE_RANLIB=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ranlib -DCMAKE_AR=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext -DPython_EXECUTABLE:STRING=/nix/store/wirj6dihrpcch7idfd7jy4l0hqfsgkk1-python3-3.13.6-env/bin/python -DCMAKE_CUDA_HOST_COMPILER:STRING=/nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/g++ -DNVCC_THREADS=3 -DCUDAToolkit_INCLUDE_DIR=/nix/store/2dc9bgppqvyd6bd5m4j9zphiyhhd39lv-libcurand-10.3.9.90-dev/include\;/nix/store/x6d389mfn7v413ia2had715g7rdgghgm-cuda_nvrtc-12.8.93-dev/include\;/nix/store/4sz65s9xk80q9jij0i4zbp9xd1pmr3ja-libcusparse-12.5.8.93-dev/include\;/nix/store/11bshw90q985bpd9ds649qmgg0x54q7x-cudnn-9.11.0.98-dev/include\;/nix/store/8dwjdyr7y3dkqlgswpn9swz884lx62gf-cuda_cccl-12.8.90-dev/include\;/nix/store/4cq7zkla3djm6g5gkpzzx4gfikda2k7z-cuda_profiler_api-12.8.90-dev/include\;/nix/store/90nghg4zsrw6gki8y8hw4id3p31bc8rk-libcusolver-11.7.3.90-dev/include\;/nix/store/vg32acb8vlqyhkhabbgvmralfw0kwhi3-cuda_cudart-12.8.90-dev/include\;/nix/store/vqg4r8izl1fy2smmw4dwv4x1adkj0rfb-libcufft-11.3.3.83-dev/include\;/nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/include\;/nix/store/5pvax5f2dg278j43b4llkdxim9y0bjaf-cuda_nvml_dev-12.8.90-dev/include\;/nix/store/klis291y7cza60yzgkxzbid80bnyshmr-cuda_nvtx-12.8.90-dev/include\;/nix/store/mps4gsnyk6s676zadvcykjxn08yghk5a-libcufile-1.13.1.3-dev/include\;/nix/store/gz9xyhflw755r8fcxkc816fp54sj0hl4-cuda_cupti-12.8.90-dev/include\;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev/include -DCUDAToolkit_ROOT=/nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93\;/nix/store/w96jlfiy431jnsww1x3ak3chhssa3i2s-libcusparse-12.5.8.93\;/nix/store/6zj6v3b9v8xdjs94iq1228slqwr757ij-libcublas-12.8.4.1\;/nix/store/q85pndpvaqdznfijmkn0mlfp8y3v08dl-cuda_cccl-12.8.90\;/nix/store/2dc9bgppqvyd6bd5m4j9zphiyhhd39lv-libcurand-10.3.9.90-dev\;/nix/store/cwy7010iwla9b2v1fx82sp66v12r913x-libcublas-12.8.4.1-lib\;/nix/store/x6d389mfn7v413ia2had715g7rdgghgm-cuda_nvrtc-12.8.93-dev\;/nix/store/22n25ss46s0hgspdp26qk025w9m393cd-libcublas-12.8.4.1-static\;/nix/store/sc5wnfvmk0j73xdppxj25kgk8s98lscs-cuda_nvrtc-12.8.93-lib\;/nix/store/54wqrrh6qbrwmv2wkz6b216ljrqbhcji-cudnn-9.11.0.98\;/nix/store/4sz65s9xk80q9jij0i4zbp9xd1pmr3ja-libcusparse-12.5.8.93-dev\;/nix/store/11bshw90q985bpd9ds649qmgg0x54q7x-cudnn-9.11.0.98-dev\;/nix/store/8dwjdyr7y3dkqlgswpn9swz884lx62gf-cuda_cccl-12.8.90-dev\;/nix/store/1v8m3gdw08hnbs7qa4jbkflm9lg1r5q6-libcurand-10.3.9.90\;/nix/store/jc58pv1cxhvpblrnzgaai60x04q6m0bp-cuda_nvml_dev-12.8.90-lib\;/nix/store/khwhv5d4kmzjpsm785iz3sva6i9sj9r5-libcufile-1.13.1.3-static\;/nix/store/xv6c2jcc3adyqks2xl28p4r0q1g4bc92-cuda_cupti-12.8.90\;/nix/store/a2h2yfjfx0si8smnqmghw7ccj0qbnv81-cuda_cupti-12.8.90-lib\;/nix/store/4cq7zkla3djm6g5gkpzzx4gfikda2k7z-cuda_profiler_api-12.8.90-dev\;/nix/store/5f6dvklv5d0mvygrrf0vzp0smcn7kk01-cuda_nvtx-12.8.90\;/nix/store/xccbzbpcn8r506zdvhvbkqkilhlrh3c5-cuda_cudart-12.8.90-lib\;/nix/store/acbir62i1d7kvka4plmxsq8442z7r1l2-cuda_cudart-12.8.90-stubs\;/nix/store/ckkcbggf4x93zg3xn9xr00jgxs2x5p21-cuda_nvml_dev-12.8.90-static\;/nix/store/ml3bkm8bz1lnjmfd8lyxbjqpi1llasr2-libcusolver-11.7.3.90\;/nix/store/9zlrjnq7lisarny3llszk131vy816x2w-libcufile-1.13.1.3\;/nix/store/90nghg4zsrw6gki8y8hw4id3p31bc8rk-libcusolver-11.7.3.90-dev\;/nix/store/vg32acb8vlqyhkhabbgvmralfw0kwhi3-cuda_cudart-12.8.90-dev\;/nix/store/y27d2s3rcw8d17wcw23glhlj5rhs8d6y-cuda_cudart-12.8.90\;/nix/store/wa9pr3485k3mw8jhv7i9kfzjrqmdl5bb-cuda_nvtx-12.8.90-lib\;/nix/store/n96pib9yj31n031dmrrx43m61js1r5rn-cuda_nvcc-12.8.93-static\;/nix/store/pabakly3280dnghh3i89wklfm61raf7z-cuda_cupti-12.8.90-sample\;/nix/store/l0jiwp1f0dhigd41qqf408c5qyabz2vd-cudnn-9.11.0.98-static\;/nix/store/95lzbxp68m127n6hyllbr3dh2mlj7y8m-libcufft-11.3.3.83\;/nix/store/lxsd5l6hnqcfgqc1nsn8mmmpx385m3k8-libcusparse-12.5.8.93-lib\;/nix/store/vqg4r8izl1fy2smmw4dwv4x1adkj0rfb-libcufft-11.3.3.83-dev\;/nix/store/4b9rdinnksj1856siw3qmwi9f10480ii-cuda_nvrtc-12.8.93-static\;/nix/store/qh7zggir1ikzh3kvkhi2mqzpyisl4153-libcurand-10.3.9.90-static\;/nix/store/n25l4gcpw8cry4rg2a4c9jw3f53i65zd-libcusolver-11.7.3.90-lib\;/nix/store/xh73kc8spwfvd6w6wc63pyq3zm6qlrja-cuda_nvml_dev-12.8.90\;/nix/store/bgiqy1z8588hgcdzyh9brhc015w3nii0-libcurand-10.3.9.90-lib\;/nix/store/5pvax5f2dg278j43b4llkdxim9y0bjaf-cuda_nvml_dev-12.8.90-dev\;/nix/store/7lf23alvk7yh64flf2mj6smx66sqyz9d-libcufile-1.13.1.3-lib\;/nix/store/klis291y7cza60yzgkxzbid80bnyshmr-cuda_nvtx-12.8.90-dev\;/nix/store/lfqj2ni7r0ir3n840b8r1lh63mnqr0ar-libcusparse-12.5.8.93-static\;/nix/store/qmw5pq21avnfvsk657k0zr4nsgwxa4jm-cuda_cudart-12.8.90-static\;/nix/store/826d39r2b4gwafqsyhvzq2bmqv8ygzrd-cuda_profiler_api-12.8.90\;/nix/store/g52lygjflrsyr6wahpf0rvs3fpna3wq9-cudnn-9.11.0.98-lib\;/nix/store/gxw5c9f7q2f1pmy0g1zyblb8p2p891a4-libcufft-11.3.3.83-lib\;/nix/store/pbsi8w1in7q44z83ndqsaxyzfrr2frgh-cuda_nvrtc-12.8.93\;/nix/store/mps4gsnyk6s676zadvcykjxn08yghk5a-libcufile-1.13.1.3-dev\;/nix/store/mvfnbb1m20fkv2n0j69ky9s9afn8p7h1-libcufft-11.3.3.83-static\;/nix/store/8byjxgnvhcyav2283wcxp752d8280c36-libcusolver-11.7.3.90-static\;/nix/store/gz9xyhflw755r8fcxkc816fp54sj0hl4-cuda_cupti-12.8.90-dev\;/nix/store/jyd8jp3q1d408n8842rb8g6ziviwm7q1-cuda_cupti-12.8.90-static\;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev -DPROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXECUTABLE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DPYBIND11_PYTHONLIBS_OVERWRITE=OFF -DPYTHON_EXECUTABLE=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/bin/python3.13 -DPYTHON_INCLUDE_DIR=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/include/python3.13 -DPYTHON_SITE_PACKAGES=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/lib/python3.13/site-packages +sage_attention-torch-ext> cmake flags: -GNinja -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/share/doc/sage_attention -DCMAKE_INSTALL_INFODIR=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/share/man -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/include -DCMAKE_INSTALL_SBINDIR=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/strip -DCMAKE_RANLIB=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/ranlib -DCMAKE_AR=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext -DPython_EXECUTABLE:STRING=/nix/store/j6r6hpjs8p5m4s3i8cqqavg62fd5z48g-python3-3.13.6-env/bin/python -DCMAKE_CUDA_HOST_COMPILER:STRING=/nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/g++ -DNVCC_THREADS=3 -DCUDAToolkit_INCLUDE_DIR=/nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85/include\;/nix/store/5f6h6xs5c74iqcjda3y73i290mfwfs9x-cuda_nvml_dev-12.6.77-dev/include\;/nix/store/r26q9f2lhsvimxha44g1xcck3adrdqwg-cuda_nvrtc-12.6.85-dev/include\;/nix/store/nj1a061pvzpq9dr65yj3jpjqcx6pr4fq-cuda_nvtx-12.6.77-dev/include\;/nix/store/9ik1skjb698l6vkx4m4wvx2nrr4sx0na-libcufft-11.3.0.4-dev/include\;/nix/store/vl1dficb0blxzqg6xqzfi5p119jvl2vi-libcusolver-11.7.1.2-dev/include\;/nix/store/n7x9kkzi2jdfj6f6yjwywfhyfmn957zp-cuda_cupti-12.6.80-dev/include\;/nix/store/sskxmb670akk0avrahrl4r6hp7925zh8-cuda_cudart-12.6.77-dev/include\;/nix/store/8a9vz66yzsar01lpgipmzq8skyk3ymkp-cuda_cccl-12.6.77-dev/include\;/nix/store/xd2xrldv3lbg1bk93nr0yccy6j0vhh2k-cudnn-9.11.0.98-dev/include\;/nix/store/0w4g3rxgkw9r0lv737rslqdk7wldmi0n-libcurand-10.3.7.77-dev/include\;/nix/store/m0s4p867fk6wk8ba7ym9yff4mayqjhlw-libcusparse-12.5.4.2-dev/include\;/nix/store/blh9iyvjkmwd871mfjvfhnp7njwgnc6b-cuda_profiler_api-12.6.77-dev/include\;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev/include\;/nix/store/4pwy3k2s52ppzbs3k6d58kda8jhmiim4-libcufile-1.11.1.6-dev/include -DCUDAToolkit_ROOT=/nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85\;/nix/store/1qgrl2sgdj5m7llm2vs9690gd9998psq-cudnn-9.11.0.98\;/nix/store/d2z15dzsgfm4r2yyl16n3wc0sw8z6fia-cuda_cupti-12.6.80-lib\;/nix/store/86ngm5djfbl6a0i43j282680chqz1vr8-libcusparse-12.5.4.2-lib\;/nix/store/bmph9rbyqnyjs02zriwq78kg16h12wi6-libcublas-12.6.4.1-lib\;/nix/store/wny8xmyma0ziffas96ansxgmjfqpw393-cuda_nvrtc-12.6.85-lib\;/nix/store/j40ndiqjiqbiqrbfmgmkzz6w8757cgvk-cuda_nvml_dev-12.6.77-lib\;/nix/store/3ii532blh586xxavim32i21kr84wlcdc-cuda_profiler_api-12.6.77\;/nix/store/j32l8jnzckhdy2lzxgyd59y7p39y6b1d-libcusolver-11.7.1.2-static\;/nix/store/5iv2zpbf4k00ch4c5zfi5b8dlj90y3d3-cuda_cccl-12.6.77\;/nix/store/a8yi28jqv5185bbv10jpjja3x98i86hm-cuda_cudart-12.6.77-stubs\;/nix/store/ya85qn68jv6mlq6gh6phh5hwk3dkynag-cuda_cudart-12.6.77-static\;/nix/store/m65ribrsnk3gbabcx9ah6phgiil19j01-libcufile-1.11.1.6\;/nix/store/5f6h6xs5c74iqcjda3y73i290mfwfs9x-cuda_nvml_dev-12.6.77-dev\;/nix/store/r26q9f2lhsvimxha44g1xcck3adrdqwg-cuda_nvrtc-12.6.85-dev\;/nix/store/nj1a061pvzpq9dr65yj3jpjqcx6pr4fq-cuda_nvtx-12.6.77-dev\;/nix/store/bcvj4g3f3n6cpb6czcb5k8zdmyd94fwi-cuda_nvtx-12.6.77-lib\;/nix/store/9ik1skjb698l6vkx4m4wvx2nrr4sx0na-libcufft-11.3.0.4-dev\;/nix/store/k5rbpivsz3ilsxg91pgigp6la8ln3cv9-cuda_cupti-12.6.80\;/nix/store/vl1dficb0blxzqg6xqzfi5p119jvl2vi-libcusolver-11.7.1.2-dev\;/nix/store/f87x0n0gi2d7rxh1ja92za2ixcw60q2p-cuda_nvtx-12.6.77\;/nix/store/n7x9kkzi2jdfj6f6yjwywfhyfmn957zp-cuda_cupti-12.6.80-dev\;/nix/store/m0fwdgh4nmrjd0q9v4m2ly63qbcq2hi2-cuda_cudart-12.6.77\;/nix/store/qfaxx4b8l1alrrl0gbyb23k3j850c0v5-libcurand-10.3.7.77-static\;/nix/store/w1npzy8mfl28w7cib5idkg6nvlbzhpzq-libcufile-1.11.1.6-lib\;/nix/store/8abbm2gd77dv0l3acw0s18wln36aa0l5-cuda_cudart-12.6.77-lib\;/nix/store/ykb9bv2lqkf1wzy73q96cb04pybx9xa2-cuda_nvcc-12.6.85-static\;/nix/store/nw9ws2qvhgdb33qgfx4iqj517814qq8y-libcufft-11.3.0.4\;/nix/store/sskxmb670akk0avrahrl4r6hp7925zh8-cuda_cudart-12.6.77-dev\;/nix/store/mfc3ah6lwfd8dfbs77b0z9i75c471b0n-libcufft-11.3.0.4-static\;/nix/store/zk3cg1ws6cskrzyhdr5d68f8zrkfk77d-cuda_nvrtc-12.6.85-static\;/nix/store/pcrirrvn2ya5d3r1y18s2zj4pm2jladw-libcusolver-11.7.1.2\;/nix/store/qdn67x8jrwr418air16kwicya4d747pq-libcufft-11.3.0.4-lib\;/nix/store/dg8hyrzy7sh3wdhcr4ywsz05cvl6vfyc-libcusparse-12.5.4.2\;/nix/store/8a9vz66yzsar01lpgipmzq8skyk3ymkp-cuda_cccl-12.6.77-dev\;/nix/store/wmcrrdxd3db58nklyp7yf90kknfdx6b5-libcurand-10.3.7.77-lib\;/nix/store/xd2xrldv3lbg1bk93nr0yccy6j0vhh2k-cudnn-9.11.0.98-dev\;/nix/store/0w4g3rxgkw9r0lv737rslqdk7wldmi0n-libcurand-10.3.7.77-dev\;/nix/store/jr1397g6pshvil5n4lnvp7dm24dm71h8-libcublas-12.6.4.1-static\;/nix/store/wq0wv7df58h6bgggnz964sk8m1hbkxxp-cuda_cupti-12.6.80-sample\;/nix/store/m0s4p867fk6wk8ba7ym9yff4mayqjhlw-libcusparse-12.5.4.2-dev\;/nix/store/blh9iyvjkmwd871mfjvfhnp7njwgnc6b-cuda_profiler_api-12.6.77-dev\;/nix/store/ngwsphsxf906z7cgwg32d1w83p809ywl-cudnn-9.11.0.98-static\;/nix/store/07zlxn68jyf4s263xafnjid55grmi7a2-cuda_nvrtc-12.6.85\;/nix/store/zyh7hqq402zc7dhafhbh9vycyzcfq256-libcurand-10.3.7.77\;/nix/store/x7mww4k0zzzb7bnffv0b22jqbyf1mg3v-cuda_cupti-12.6.80-static\;/nix/store/xvlapjc6spss1kvbjlq97m6pk19hfrxz-cuda_nvml_dev-12.6.77\;/nix/store/7j4zf0r8flh7l4x5pm1mgqb2vcabmcdj-libcusolver-11.7.1.2-lib\;/nix/store/gs8gw8bgjccrjxlyzhxa7h85gkxgqwhn-libcufile-1.11.1.6-static\;/nix/store/p9dnsv7mv8mqm9aisrckq8lm3zs3l7dk-cudnn-9.11.0.98-lib\;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev\;/nix/store/dpska4iiya4xa5zzzmqzx3ljws73bnds-cuda_nvml_dev-12.6.77-static\;/nix/store/gzykkbwmch7pxgfzf86fg0b928lz6b36-libcusparse-12.5.4.2-static\;/nix/store/nqn7lvw8gbwbymdhz4nak9wf9b5bbah9-libcublas-12.6.4.1\;/nix/store/4pwy3k2s52ppzbs3k6d58kda8jhmiim4-libcufile-1.11.1.6-dev -DPROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DProtobuf_PROTOC_EXECUTABLE=/nix/store/g82m0ia59azh4a1bcrk0r15qck6hp8da-protobuf-31.1/bin/protoc -DPYBIND11_PYTHONLIBS_OVERWRITE=OFF -DPYTHON_EXECUTABLE=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/bin/python3.13 -DPYTHON_INCLUDE_DIR=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/include/python3.13 -DPYTHON_SITE_PACKAGES=/nix/store/iyff8129iampdw13nlfqalzhxy8y1hi9-python3-3.13.6/lib/python3.13/site-packages sage_attention-torch-ext> -- The CXX compiler identification is GNU 14.3.0 sage_attention-torch-ext> -- The CXX compiler identification is GNU 14.3.0 +sage_attention-torch-ext> -- The CXX compiler identification is GNU 13.4.0 +sage_attention-torch-ext> -- Detecting CXX compiler ABI info sage_attention-torch-ext> -- Detecting CXX compiler ABI info sage_attention-torch-ext> -- Detecting CXX compiler ABI info sage_attention-torch-ext> -- The CXX compiler identification is GNU 14.3.0 +sage_attention-torch-ext> -- The CXX compiler identification is GNU 13.4.0 +sage_attention-torch-ext> -- Detecting CXX compiler ABI info sage_attention-torch-ext> -- Detecting CXX compiler ABI info sage_attention-torch-ext> -- Detecting CXX compiler ABI info - done sage_attention-torch-ext> -- Detecting CXX compiler ABI info - done sage_attention-torch-ext> -- Detecting CXX compiler ABI info - done sage_attention-torch-ext> -- Detecting CXX compiler ABI info - done -sage_attention-torch-ext> -- Check for working CXX compiler: /nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/g++ - skipped +sage_attention-torch-ext> -- Detecting CXX compiler ABI info - done +sage_attention-torch-ext> -- Check for working CXX compiler: /nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/g++ - skipped sage_attention-torch-ext> -- Detecting CXX compile features sage_attention-torch-ext> -- Detecting CXX compile features - done sage_attention-torch-ext> -- Check for working CXX compiler: /nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/g++ - skipped -sage_attention-torch-ext> -- Detecting CXX compile features -sage_attention-torch-ext> -- Detecting CXX compile features - done -sage_attention-torch-ext> -- Detecting CXX compiler ABI info - done sage_attention-torch-ext> -- Check for working CXX compiler: /nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/g++ - skipped sage_attention-torch-ext> -- Detecting CXX compile features +sage_attention-torch-ext> -- Detecting CXX compile features sage_attention-torch-ext> -- Detecting CXX compile features - done -sage_attention-torch-ext> -- Check for working CXX compiler: /nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/g++ - skipped +sage_attention-torch-ext> -- Detecting CXX compile features - done +sage_attention-torch-ext> -- Check for working CXX compiler: /nix/store/rgfv9lch0b6ksjzlzsx0mljsb0ypqr8x-gcc-wrapper-13.4.0/bin/g++ - skipped sage_attention-torch-ext> -- Detecting CXX compile features sage_attention-torch-ext> -- Detecting CXX compile features - done -sage_attention-torch-ext> -- FetchContent base directory: /build/source/build/_deps sage_attention-torch-ext> -- Check for working CXX compiler: /nix/store/d8likaw8xxdmh2qmmasbm88h74q6a2gr-gcc-wrapper-14.3.0/bin/g++ - skipped sage_attention-torch-ext> -- Detecting CXX compile features sage_attention-torch-ext> -- Detecting CXX compile features - done @@ -112,24 +111,25 @@ sage_attention-torch-ext> -- FetchContent base directory: /build/source/build/_d sage_attention-torch-ext> -- FetchContent base directory: /build/source/build/_deps sage_attention-torch-ext> -- FetchContent base directory: /build/source/build/_deps sage_attention-torch-ext> -- FetchContent base directory: /build/source/build/_deps -sage_attention-torch-ext> -- Found Python: /nix/store/j6r6hpjs8p5m4s3i8cqqavg62fd5z48g-python3-3.13.6-env/bin/python (found version "3.13.6") found components: Development Development.SABIModule Interpreter Development.Module Development.Embed +sage_attention-torch-ext> -- FetchContent base directory: /build/source/build/_deps sage_attention-torch-ext> -- Found Python: /nix/store/r3gwdvvsgl1csl12f4pkhz0jhsch7bdy-python3-3.13.6-env/bin/python (found version "3.13.6") found components: Development Development.SABIModule Interpreter Development.Module Development.Embed sage_attention-torch-ext> -- Found Python: /nix/store/aikr517kmcd8r2nrrj70jq71d7352qiq-python3-3.13.6-env/bin/python (found version "3.13.6") found components: Development Development.SABIModule Interpreter Development.Module Development.Embed sage_attention-torch-ext> -- Found Python: /nix/store/qal2apcjwlw2p2kk05dwqdgzh8ml687l-python3-3.13.6-env/bin/python (found version "3.13.6") found components: Development Development.SABIModule Interpreter Development.Module Development.Embed +sage_attention-torch-ext> -- Found Python: /nix/store/j6r6hpjs8p5m4s3i8cqqavg62fd5z48g-python3-3.13.6-env/bin/python (found version "3.13.6") found components: Development Development.SABIModule Interpreter Development.Module Development.Embed sage_attention-torch-ext> -- Found Python: /nix/store/wirj6dihrpcch7idfd7jy4l0hqfsgkk1-python3-3.13.6-env/bin/python (found version "3.13.6") found components: Development Development.SABIModule Interpreter Development.Module Development.Embed +sage_attention-torch-ext> -- Found CUDA: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93 (found version "12.8") sage_attention-torch-ext> -- Found CUDA: /nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85 (found version "12.6") sage_attention-torch-ext> -- Found CUDA: /nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86 (found version "12.9") sage_attention-torch-ext> -- Found CUDA: /nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85 (found version "12.6") sage_attention-torch-ext> -- Found CUDA: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93 (found version "12.8") -sage_attention-torch-ext> -- Found CUDA: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93 (found version "12.8") sage_attention-torch-ext> -- The CUDA compiler identification is NVIDIA 12.6.85 with host compiler GNU 13.4.0 sage_attention-torch-ext> -- Detecting CUDA compiler ABI info -sage_attention-torch-ext> -- The CUDA compiler identification is NVIDIA 12.9.86 with host compiler GNU 14.3.0 -sage_attention-torch-ext> -- Detecting CUDA compiler ABI info sage_attention-torch-ext> -- The CUDA compiler identification is NVIDIA 12.6.85 with host compiler GNU 13.4.0 sage_attention-torch-ext> -- Detecting CUDA compiler ABI info sage_attention-torch-ext> -- The CUDA compiler identification is NVIDIA 12.8.93 with host compiler GNU 14.3.0 sage_attention-torch-ext> -- Detecting CUDA compiler ABI info +sage_attention-torch-ext> -- The CUDA compiler identification is NVIDIA 12.9.86 with host compiler GNU 14.3.0 +sage_attention-torch-ext> -- Detecting CUDA compiler ABI info sage_attention-torch-ext> -- The CUDA compiler identification is NVIDIA 12.8.93 with host compiler GNU 14.3.0 sage_attention-torch-ext> -- Detecting CUDA compiler ABI info sage_attention-torch-ext> -- Detecting CUDA compiler ABI info - done @@ -138,26 +138,26 @@ sage_attention-torch-ext> -- Check for working CUDA compiler: /nix/store/7iw4ipb sage_attention-torch-ext> -- Detecting CUDA compile features sage_attention-torch-ext> -- Detecting CUDA compile features - done sage_attention-torch-ext> -- Detecting CUDA compiler ABI info - done -sage_attention-torch-ext> -- Found CUDAToolkit: /nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85/include;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev/include (found version "12.6.85") -sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -sage_attention-torch-ext> -- Check for working CUDA compiler: /nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86/bin/nvcc - skipped -sage_attention-torch-ext> -- Detecting CUDA compile features -sage_attention-torch-ext> -- Detecting CUDA compile features - done -sage_attention-torch-ext> -- Detecting CUDA compiler ABI info - done -sage_attention-torch-ext> -- Found CUDAToolkit: /nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86/include;/nix/store/dd8wl3nnsigw2gj5bwaiswla97jpw1jz-libcublas-12.9.1.4-dev/include (found version "12.9.86") -sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD sage_attention-torch-ext> -- Check for working CUDA compiler: /nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85/bin/nvcc - skipped sage_attention-torch-ext> -- Detecting CUDA compile features sage_attention-torch-ext> -- Detecting CUDA compile features - done +sage_attention-torch-ext> -- Found CUDAToolkit: /nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85/include;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev/include (found version "12.6.85") +sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD sage_attention-torch-ext> -- Detecting CUDA compiler ABI info - done -sage_attention-torch-ext> -- Check for working CUDA compiler: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/bin/nvcc - skipped sage_attention-torch-ext> -- Found CUDAToolkit: /nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85/include;/nix/store/fy71fffqbwg3xgvygn66kd4igj65gblv-libcublas-12.6.4.1-dev/include (found version "12.6.85") sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD +sage_attention-torch-ext> -- Check for working CUDA compiler: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/bin/nvcc - skipped +sage_attention-torch-ext> -- Detecting CUDA compile features +sage_attention-torch-ext> -- Detecting CUDA compile features - done +sage_attention-torch-ext> -- Check for working CUDA compiler: /nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86/bin/nvcc - skipped sage_attention-torch-ext> -- Detecting CUDA compile features sage_attention-torch-ext> -- Detecting CUDA compile features - done +sage_attention-torch-ext> -- Detecting CUDA compiler ABI info - done +sage_attention-torch-ext> -- Found CUDAToolkit: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/include;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev/include (found version "12.8.93") +sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed sage_attention-torch-ext> -- Looking for pthread_create in pthreads -sage_attention-torch-ext> -- Found CUDAToolkit: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/include;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev/include (found version "12.8.93") +sage_attention-torch-ext> -- Found CUDAToolkit: /nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86/include;/nix/store/dd8wl3nnsigw2gj5bwaiswla97jpw1jz-libcublas-12.9.1.4-dev/include (found version "12.9.86") sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed sage_attention-torch-ext> -- Looking for pthread_create in pthreads @@ -168,22 +168,22 @@ sage_attention-torch-ext> -- Looking for pthread_create in pthreads - not found sage_attention-torch-ext> -- Looking for pthread_create in pthread sage_attention-torch-ext> -- Found CUDAToolkit: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/include;/nix/store/qa4d2v0lsm6giyr4b4421qsdygz0yrrh-libcublas-12.8.4.1-dev/include (found version "12.8.93") sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -sage_attention-torch-ext> -- Looking for pthread_create in pthreads sage_attention-torch-ext> -- Looking for pthread_create in pthreads - not found sage_attention-torch-ext> -- Looking for pthread_create in pthread sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed sage_attention-torch-ext> -- Looking for pthread_create in pthreads -sage_attention-torch-ext> -- Looking for pthread_create in pthread - found -sage_attention-torch-ext> -- Found Threads: TRUE -sage_attention-torch-ext> -- Looking for pthread_create in pthreads - not found -sage_attention-torch-ext> -- Looking for pthread_create in pthread sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed sage_attention-torch-ext> -- Looking for pthread_create in pthreads +sage_attention-torch-ext> -- Looking for pthread_create in pthread - found +sage_attention-torch-ext> -- Found Threads: TRUE sage_attention-torch-ext> -- Looking for pthread_create in pthreads - not found sage_attention-torch-ext> -- Looking for pthread_create in pthread sage_attention-torch-ext> -- Looking for pthread_create in pthread - found sage_attention-torch-ext> -- Found Threads: TRUE +sage_attention-torch-ext> -- Looking for pthread_create in pthreads - not found +sage_attention-torch-ext> -- Looking for pthread_create in pthread +sage_attention-torch-ext> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed +sage_attention-torch-ext> -- Looking for pthread_create in pthreads sage_attention-torch-ext> -- Looking for pthread_create in pthread - found sage_attention-torch-ext> -- Found Threads: TRUE sage_attention-torch-ext> -- Looking for pthread_create in pthreads - not found @@ -195,12 +195,12 @@ sage_attention-torch-ext> -- Found Threads: TRUE sage_attention-torch-ext> -- PyTorch: CUDA detected: 12.6 sage_attention-torch-ext> -- PyTorch: CUDA nvcc is: /nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85/bin/nvcc sage_attention-torch-ext> -- PyTorch: CUDA toolkit directory: /nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85 -sage_attention-torch-ext> -- PyTorch: CUDA detected: 12.9 -sage_attention-torch-ext> -- PyTorch: CUDA nvcc is: /nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86/bin/nvcc -sage_attention-torch-ext> -- PyTorch: CUDA toolkit directory: /nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86 sage_attention-torch-ext> -- PyTorch: CUDA detected: 12.6 sage_attention-torch-ext> -- PyTorch: CUDA nvcc is: /nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85/bin/nvcc sage_attention-torch-ext> -- PyTorch: CUDA toolkit directory: /nix/store/7iw4ipbdy17yzvqjhxpw03i17kq7f7rj-cuda_nvcc-12.6.85 +sage_attention-torch-ext> -- PyTorch: CUDA detected: 12.9 +sage_attention-torch-ext> -- PyTorch: CUDA nvcc is: /nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86/bin/nvcc +sage_attention-torch-ext> -- PyTorch: CUDA toolkit directory: /nix/store/8zrv6h6f2cfz34pwq012n4cx2zrv5m1s-cuda_nvcc-12.9.86 sage_attention-torch-ext> -- PyTorch: CUDA detected: 12.8 sage_attention-torch-ext> -- PyTorch: CUDA nvcc is: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/bin/nvcc sage_attention-torch-ext> -- PyTorch: CUDA toolkit directory: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93 @@ -219,6 +219,7 @@ sage_attention-torch-ext> -- USE_CUSPARSELT is set to 0. Compiling without cuSPA sage_attention-torch-ext> -- USE_CUDSS is set to 0. Compiling without cuDSS support sage_attention-torch-ext> -- USE_CUFILE is set to 0. Compiling without cuFile support sage_attention-torch-ext> -- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90 +sage_attention-torch-ext> -- PyTorch: Header version is: 12.6 sage_attention-torch-ext> CMake Warning at /nix/store/ld6fk094jhhsnbip1406vrky9lmyxbax-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): sage_attention-torch-ext> static library kineto_LIBRARY-NOTFOUND not found. sage_attention-torch-ext> Call Stack (most recent call first): @@ -229,36 +230,6 @@ sage_attention-torch-ext> sage_attention-torch-ext> -- Found Torch: /nix/store/pg32mpjmckfs38anjzgyvk2ljfw12pb3-python3.13-torch-2.8.0-lib/lib/libtorch.so sage_attention-torch-ext> -- CUDA target architectures: 7.0;7.5;8.0;8.6;8.9;9.0 sage_attention-torch-ext> -- CUDA supported target architectures: 7.0;7.5;8.0;8.6;8.9;9.0 -sage_attention-torch-ext> -- PyTorch: Header version is: 12.9 -sage_attention-torch-ext> -- PyTorch: CUDA detected: 12.8 -sage_attention-torch-ext> -- PyTorch: CUDA nvcc is: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/bin/nvcc -sage_attention-torch-ext> -- PyTorch: CUDA toolkit directory: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93 -sage_attention-torch-ext> -- Found Python: /nix/store/aikr517kmcd8r2nrrj70jq71d7352qiq-python3-3.13.6-env/bin/python (found version "3.13.6") found components: Interpreter -sage_attention-torch-ext> CMake Warning at /nix/store/483ma0klnbln74izv5jiyila52bfwqxh-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:146 (message): -sage_attention-torch-ext> Failed to compute shorthash for libnvrtc.so -sage_attention-torch-ext> Call Stack (most recent call first): -sage_attention-torch-ext> /nix/store/483ma0klnbln74izv5jiyila52bfwqxh-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include) -sage_attention-torch-ext> /nix/store/483ma0klnbln74izv5jiyila52bfwqxh-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package) -sage_attention-torch-ext> CMakeLists.txt:30 (find_package) -sage_attention-torch-ext> -sage_attention-torch-ext> -sage_attention-torch-ext> -- USE_CUDNN is set to 0. Compiling without cuDNN support -sage_attention-torch-ext> -- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support -sage_attention-torch-ext> -- USE_CUDSS is set to 0. Compiling without cuDSS support -sage_attention-torch-ext> -- USE_CUFILE is set to 0. Compiling without cuFile support -sage_attention-torch-ext> -- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_100,code=sm_100;-gencode;arch=compute_101,code=sm_101;-gencode;arch=compute_120,code=sm_120 -sage_attention-torch-ext> -- PyTorch: Header version is: 12.6 -sage_attention-torch-ext> CMake Warning at /nix/store/483ma0klnbln74izv5jiyila52bfwqxh-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): -sage_attention-torch-ext> static library kineto_LIBRARY-NOTFOUND not found. -sage_attention-torch-ext> Call Stack (most recent call first): -sage_attention-torch-ext> /nix/store/483ma0klnbln74izv5jiyila52bfwqxh-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:125 (append_torchlib_if_found) -sage_attention-torch-ext> CMakeLists.txt:30 (find_package) -sage_attention-torch-ext> -sage_attention-torch-ext> -sage_attention-torch-ext> -- Found Torch: /nix/store/zccgvlbr93bhyia3sr9f2mddmkp2jyx7-python3.13-torch-2.8.0-lib/lib/libtorch.so -sage_attention-torch-ext> -- CUDA target architectures: 7.0;7.5;8.0;8.6;8.9;9.0;10.0;10.1;12.0 -sage_attention-torch-ext> -- CUDA supported target architectures: 7.0;7.5;8.0;8.6;8.9;9.0;10.0;10.1;12.0 -sage_attention-torch-ext> -- PyTorch: Header version is: 12.8 sage_attention-torch-ext> -- Found Python: /nix/store/j6r6hpjs8p5m4s3i8cqqavg62fd5z48g-python3-3.13.6-env/bin/python (found version "3.13.6") found components: Interpreter sage_attention-torch-ext> CMake Warning at /nix/store/dzz5brlw0xzs9hp3v8fvvwcvkmsr3ls9-python3.13-torch-2.7.1/lib/python3.13/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:146 (message): sage_attention-torch-ext> Failed to compute shorthash for libnvrtc.so @@ -305,7 +276,36 @@ sage_attention-torch-ext> CMakeLists.txt:30 (find_package) sage_attention-torch-ext> sage_attention-torch-ext> sage_attention-torch-ext> -- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90 +sage_attention-torch-ext> CMake Warning at /nix/store/dzz5brlw0xzs9hp3v8fvvwcvkmsr3ls9-python3.13-torch-2.7.1/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): +sage_attention-torch-ext> static library kineto_LIBRARY-NOTFOUND not found. +sage_attention-torch-ext> Call Stack (most recent call first): +sage_attention-torch-ext> /nix/store/dzz5brlw0xzs9hp3v8fvvwcvkmsr3ls9-python3.13-torch-2.7.1/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:125 (append_torchlib_if_found) +sage_attention-torch-ext> CMakeLists.txt:30 (find_package) +sage_attention-torch-ext> +sage_attention-torch-ext> +sage_attention-torch-ext> -- Found Torch: /nix/store/8sicfhvzq84gnxiwybyjgp80pcynamzn-python3.13-torch-2.7.1-lib/lib/libtorch.so +sage_attention-torch-ext> -- CUDA target architectures: 7.0;7.5;8.0;8.6;8.9;9.0 +sage_attention-torch-ext> -- CUDA supported target architectures: 7.0;7.5;8.0;8.6;8.9;9.0 +sage_attention-torch-ext> -- PyTorch: CUDA detected: 12.8 +sage_attention-torch-ext> -- PyTorch: CUDA nvcc is: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93/bin/nvcc +sage_attention-torch-ext> -- PyTorch: CUDA toolkit directory: /nix/store/8kyv8ffbfvksnqmm1kaz0llysg7dpn9z-cuda_nvcc-12.8.93 +sage_attention-torch-ext> -- PyTorch: Header version is: 12.9 +sage_attention-torch-ext> -- PyTorch: Header version is: 12.8 +sage_attention-torch-ext> -- Found Python: /nix/store/aikr517kmcd8r2nrrj70jq71d7352qiq-python3-3.13.6-env/bin/python (found version "3.13.6") found components: Interpreter +sage_attention-torch-ext> CMake Warning at /nix/store/483ma0klnbln74izv5jiyila52bfwqxh-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:146 (message): +sage_attention-torch-ext> Failed to compute shorthash for libnvrtc.so +sage_attention-torch-ext> Call Stack (most recent call first): +sage_attention-torch-ext> /nix/store/483ma0klnbln74izv5jiyila52bfwqxh-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include) +sage_attention-torch-ext> /nix/store/483ma0klnbln74izv5jiyila52bfwqxh-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package) +sage_attention-torch-ext> CMakeLists.txt:30 (find_package) +sage_attention-torch-ext> +sage_attention-torch-ext> +sage_attention-torch-ext> -- USE_CUDNN is set to 0. Compiling without cuDNN support +sage_attention-torch-ext> -- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support +sage_attention-torch-ext> -- USE_CUDSS is set to 0. Compiling without cuDSS support +sage_attention-torch-ext> -- USE_CUFILE is set to 0. Compiling without cuFile support sage_attention-torch-ext> -- Found Python: /nix/store/qal2apcjwlw2p2kk05dwqdgzh8ml687l-python3-3.13.6-env/bin/python (found version "3.13.6") found components: Interpreter +sage_attention-torch-ext> -- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_100,code=sm_100;-gencode;arch=compute_101,code=sm_101;-gencode;arch=compute_120,code=sm_120 sage_attention-torch-ext> CMake Warning at /nix/store/6drs80sxjhskdki55g5k1dw0jzbd258w-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:146 (message): sage_attention-torch-ext> Failed to compute shorthash for libnvrtc.so sage_attention-torch-ext> Call Stack (most recent call first): @@ -319,16 +319,13 @@ sage_attention-torch-ext> -- USE_CUSPARSELT is set to 0. Compiling without cuSPA sage_attention-torch-ext> -- USE_CUDSS is set to 0. Compiling without cuDSS support sage_attention-torch-ext> -- USE_CUFILE is set to 0. Compiling without cuFile support sage_attention-torch-ext> -- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_100,code=sm_100;-gencode;arch=compute_101,code=sm_101;-gencode;arch=compute_120,code=sm_120 -sage_attention-torch-ext> CMake Warning at /nix/store/dzz5brlw0xzs9hp3v8fvvwcvkmsr3ls9-python3.13-torch-2.7.1/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): +sage_attention-torch-ext> CMake Warning at /nix/store/483ma0klnbln74izv5jiyila52bfwqxh-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): sage_attention-torch-ext> static library kineto_LIBRARY-NOTFOUND not found. sage_attention-torch-ext> Call Stack (most recent call first): -sage_attention-torch-ext> /nix/store/dzz5brlw0xzs9hp3v8fvvwcvkmsr3ls9-python3.13-torch-2.7.1/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:125 (append_torchlib_if_found) +sage_attention-torch-ext> /nix/store/483ma0klnbln74izv5jiyila52bfwqxh-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:125 (append_torchlib_if_found) sage_attention-torch-ext> CMakeLists.txt:30 (find_package) sage_attention-torch-ext> sage_attention-torch-ext> -sage_attention-torch-ext> -- Found Torch: /nix/store/8sicfhvzq84gnxiwybyjgp80pcynamzn-python3.13-torch-2.7.1-lib/lib/libtorch.so -sage_attention-torch-ext> -- CUDA target architectures: 7.0;7.5;8.0;8.6;8.9;9.0 -sage_attention-torch-ext> -- CUDA supported target architectures: 7.0;7.5;8.0;8.6;8.9;9.0 sage_attention-torch-ext> CMake Warning at /nix/store/6drs80sxjhskdki55g5k1dw0jzbd258w-python3.13-torch-2.8.0/lib/python3.13/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): sage_attention-torch-ext> static library kineto_LIBRARY-NOTFOUND not found. sage_attention-torch-ext> Call Stack (most recent call first): @@ -336,6 +333,9 @@ sage_attention-torch-ext> /nix/store/6drs80sxjhskdki55g5k1dw0jzbd258w-python3. sage_attention-torch-ext> CMakeLists.txt:30 (find_package) sage_attention-torch-ext> sage_attention-torch-ext> +sage_attention-torch-ext> -- Found Torch: /nix/store/zccgvlbr93bhyia3sr9f2mddmkp2jyx7-python3.13-torch-2.8.0-lib/lib/libtorch.so +sage_attention-torch-ext> -- CUDA target architectures: 7.0;7.5;8.0;8.6;8.9;9.0;10.0;10.1;12.0 +sage_attention-torch-ext> -- CUDA supported target architectures: 7.0;7.5;8.0;8.6;8.9;9.0;10.0;10.1;12.0 sage_attention-torch-ext> -- Found Torch: /nix/store/mrq1wi2biib2p1mks17g8g5sc4fd492r-python3.13-torch-2.8.0-lib/lib/libtorch.so sage_attention-torch-ext> -- CUDA target architectures: 7.0;7.5;8.0;8.6;8.9;9.0;10.0;10.1;12.0 sage_attention-torch-ext> -- CUDA supported target architectures: 7.0;7.5;8.0;8.6;8.9;9.0;10.0;10.1;12.0 @@ -396,11 +396,11 @@ sage_attention-torch-ext> sage_attention-torch-ext> -- Found Torch: /nix/store/35sj4in2ddx47klyg96qmkpd4vh8py94-python3.13-torch-2.7.1-lib/lib/libtorch.so sage_attention-torch-ext> -- CUDA target architectures: 7.0;7.5;8.0;8.6;8.9;9.0;10.0;10.1;12.0 sage_attention-torch-ext> -- CUDA supported target architectures: 7.0;7.5;8.0;8.6;8.9;9.0;10.0;10.1;12.0 -sage_attention-torch-ext> -- Capabilities for kernel _fused: 9.0a;8.0;8.9 sage_attention-torch-ext> -- Capabilities for kernel _qattn: 9.0a;8.0;8.9 sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm80: 8.0 -sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm89: 8.9 sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm90: 9.0a +sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm89: 8.9 +sage_attention-torch-ext> -- Capabilities for kernel _fused: 9.0a;8.0;8.9 sage_attention-torch-ext> -- Configuring done (9.3s) sage_attention-torch-ext> -- Generating done (0.0s) sage_attention-torch-ext> CMake Warning: @@ -433,12 +433,12 @@ sage_attention-torch-ext> cmake: enabled parallel building sage_attention-torch-ext> cmake: enabled parallel installing sage_attention-torch-ext> Running phase: buildPhase sage_attention-torch-ext> build flags: -j21 +sage_attention-torch-ext> -- Capabilities for kernel _qattn: 9.0a;8.0;8.9 sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm89: 8.9 sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm90: 9.0a sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm80: 8.0 sage_attention-torch-ext> -- Capabilities for kernel _fused: 9.0a;8.0;8.9 -sage_attention-torch-ext> -- Capabilities for kernel _qattn: 9.0a;8.0;8.9 -sage_attention-torch-ext> -- Configuring done (9.5s) +sage_attention-torch-ext> -- Configuring done (9.4s) sage_attention-torch-ext> -- Generating done (0.0s) sage_attention-torch-ext> CMake Warning: sage_attention-torch-ext> Manually-specified variables were not used by the project: @@ -471,11 +471,11 @@ sage_attention-torch-ext> cmake: enabled parallel installing sage_attention-torch-ext> Running phase: buildPhase sage_attention-torch-ext> build flags: -j21 sage_attention-torch-ext> -- Capabilities for kernel _fused: 9.0a;8.0;8.9 -sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm89: 8.9 sage_attention-torch-ext> -- Capabilities for kernel _qattn: 9.0a;8.0;8.9 +sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm89: 8.9 sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm80: 8.0 sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm90: 9.0a -sage_attention-torch-ext> -- Configuring done (9.6s) +sage_attention-torch-ext> -- Configuring done (9.4s) sage_attention-torch-ext> -- Generating done (0.0s) sage_attention-torch-ext> CMake Warning: sage_attention-torch-ext> Manually-specified variables were not used by the project: @@ -507,12 +507,12 @@ sage_attention-torch-ext> cmake: enabled parallel building sage_attention-torch-ext> cmake: enabled parallel installing sage_attention-torch-ext> Running phase: buildPhase sage_attention-torch-ext> build flags: -j21 -sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm90: 9.0a sage_attention-torch-ext> -- Capabilities for kernel _qattn: 9.0a;8.0;8.9 -sage_attention-torch-ext> -- Capabilities for kernel _fused: 9.0a;8.0;8.9 -sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm89: 8.9 +sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm90: 9.0a sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm80: 8.0 -sage_attention-torch-ext> -- Configuring done (9.7s) +sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm89: 8.9 +sage_attention-torch-ext> -- Capabilities for kernel _fused: 9.0a;8.0;8.9 +sage_attention-torch-ext> -- Configuring done (9.5s) sage_attention-torch-ext> -- Generating done (0.0s) sage_attention-torch-ext> CMake Warning: sage_attention-torch-ext> Manually-specified variables were not used by the project: @@ -545,11 +545,11 @@ sage_attention-torch-ext> cmake: enabled parallel installing sage_attention-torch-ext> Running phase: buildPhase sage_attention-torch-ext> build flags: -j21 sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm89: 8.9 +sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm80: 8.0 sage_attention-torch-ext> -- Capabilities for kernel _fused: 9.0a;8.0;8.9 sage_attention-torch-ext> -- Capabilities for kernel _qattn: 9.0a;8.0;8.9 -sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm80: 8.0 sage_attention-torch-ext> -- Capabilities for kernel _qattn_sm90: 9.0a -sage_attention-torch-ext> -- Configuring done (9.8s) +sage_attention-torch-ext> -- Configuring done (9.9s) sage_attention-torch-ext> -- Generating done (0.0s) sage_attention-torch-ext> CMake Warning: sage_attention-torch-ext> Manually-specified variables were not used by the project: @@ -581,3652 +581,2496 @@ sage_attention-torch-ext> cmake: enabled parallel building sage_attention-torch-ext> cmake: enabled parallel installing sage_attention-torch-ext> Running phase: buildPhase sage_attention-torch-ext> build flags: -j21 -sage_attention-torch-ext> [1/12] Building CXX object CMakeFiles/_sage_attention_44b112f_dirty.dir/torch-ext/torch_binding.cpp.o -sage_attention-torch-ext> [1/12] Building CXX object CMakeFiles/_sage_attention_44b112f_dirty.dir/torch-ext/torch_binding.cpp.o -sage_attention-torch-ext> [1/12] Building CXX object CMakeFiles/_sage_attention_44b112f_dirty.dir/torch-ext/torch_binding.cpp.o -sage_attention-torch-ext> [1/12] Building CXX object CMakeFiles/_sage_attention_44b112f_dirty.dir/torch-ext/torch_binding.cpp.o -sage_attention-torch-ext> [1/12] Building CXX object CMakeFiles/_sage_attention_44b112f_dirty.dir/torch-ext/torch_binding.cpp.o -sage_attention-torch-ext> [2/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn.cu.o +sage_attention-torch-ext> [1/12] Building CXX object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/torch-ext/torch_binding.cpp.o +sage_attention-torch-ext> [1/12] Building CXX object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/torch-ext/torch_binding.cpp.o +sage_attention-torch-ext> [1/12] Building CXX object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/torch-ext/torch_binding.cpp.o +sage_attention-torch-ext> [1/12] Building CXX object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/torch-ext/torch_binding.cpp.o +sage_attention-torch-ext> [1/12] Building CXX object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/torch-ext/torch_binding.cpp.o +sage_attention-torch-ext> [2/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 441.985 ms +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 435.474 ms +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 542.186 ms +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 537.633 ms +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 556.850 ms +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 558.532 ms +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 525.842 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 527.640 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 453.136 ms +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 455.551 ms +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 560.086 ms +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 566.771 ms +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 451.886 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 452.820 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 563.157 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 552.506 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 223.644 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 223.313 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 228.182 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 228.007 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 221.210 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 221.780 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 228.300 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 227.566 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.498 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 244.960 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 249.012 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 249.548 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 244.845 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 244.634 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 251.408 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.002 ms -sage_attention-torch-ext> [2/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu.o +sage_attention-torch-ext> [2/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn_inst_buf.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(627): warning #177-D: variable "padded_kv_len" was declared but never referenced -sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> Remark: The warnings can be suppressed with "-diag-suppress " -sage_attention-torch-ext> -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(803): warning #177-D: variable "padded_kv_len" was declared but never referenced -sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(170): warning #177-D: variable "sO" was declared but never referenced -sage_attention-torch-ext> half *sO = (half*)smem_; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> ptxas info : 10 bytes gmem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 589.615 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 587.369 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 593.451 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 593.487 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 585.325 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 584.058 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 586.599 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 585.711 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 487.314 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 482.640 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 617.863 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 621.067 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 497.092 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 490.878 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 618.598 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 620.754 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 315.800 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 313.400 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 321.900 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 323.623 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 315.497 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 314.046 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 321.355 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 323.571 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 338.518 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 335.776 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 342.095 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 344.389 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 336.715 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 338.056 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 343.889 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 341.851 ms +sage_attention-torch-ext> [2/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 68 bytes spill stores, 52 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 68 bytes spill stores, 52 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 20 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 20 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> [3/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn.cu.o +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [3/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_attn_inst_buf.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [3/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 586.615 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 588.685 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 592.250 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 593.767 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 574.520 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 573.749 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 586.361 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 582.063 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 522.972 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 516.314 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 629.345 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 626.246 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 502.294 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 503.470 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 619.844 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 616.534 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 238.306 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.621 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.968 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 249.797 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 240.832 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.605 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.794 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.869 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 262.329 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 262.374 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 270.911 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 270.832 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 263.277 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 263.914 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 272.102 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 269.136 ms -sage_attention-torch-ext> [4/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [5/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 264 bytes stack frame, 268 bytes spill stores, 280 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 264 bytes stack frame, 268 bytes spill stores, 280 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 288 bytes stack frame, 260 bytes spill stores, 268 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 288 bytes stack frame, 260 bytes spill stores, 268 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 256 bytes spill stores, 260 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 256 bytes spill stores, 260 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 296 bytes spill stores, 300 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 296 bytes spill stores, 300 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 232 bytes spill stores, 248 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 232 bytes spill stores, 248 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 292 bytes spill stores, 292 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 292 bytes spill stores, 292 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 228 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 228 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [2/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(627): warning #177-D: variable "padded_kv_len" was declared but never referenced -sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> Remark: The warnings can be suppressed with "-diag-suppress " -sage_attention-torch-ext> -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(803): warning #177-D: variable "padded_kv_len" was declared but never referenced -sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(170): warning #177-D: variable "sO" was declared but never referenced -sage_attention-torch-ext> half *sO = (half*)smem_; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> ptxas info : 10 bytes gmem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 264 bytes stack frame, 268 bytes spill stores, 280 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 264 bytes stack frame, 268 bytes spill stores, 280 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 288 bytes stack frame, 260 bytes spill stores, 268 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 288 bytes stack frame, 260 bytes spill stores, 268 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 256 bytes spill stores, 260 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 256 bytes spill stores, 260 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 296 bytes spill stores, 300 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 296 bytes spill stores, 300 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 232 bytes spill stores, 248 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 232 bytes spill stores, 248 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 292 bytes spill stores, 292 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 292 bytes spill stores, 292 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 228 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 228 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> [3/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn_inst_buf.cu.o +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [3/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 24 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 24 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 88 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 88 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 24 bytes stack frame, 36 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 24 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 24 bytes stack frame, 36 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 24 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 264 bytes stack frame, 276 bytes spill stores, 280 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 264 bytes stack frame, 276 bytes spill stores, 280 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 288 bytes stack frame, 268 bytes spill stores, 268 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 288 bytes stack frame, 268 bytes spill stores, 268 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 272 bytes spill stores, 276 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 272 bytes spill stores, 276 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 224 bytes stack frame, 252 bytes spill stores, 248 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 224 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 224 bytes stack frame, 252 bytes spill stores, 248 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 224 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 216 bytes spill stores, 232 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 216 bytes spill stores, 232 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 216 bytes stack frame, 244 bytes spill stores, 236 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 216 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 216 bytes stack frame, 244 bytes spill stores, 236 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 216 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [4/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/fused/fused.cu.o +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [2/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/fused/fused.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 11 bytes gmem sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 52.664 ms +sage_attention-torch-ext> ptxas info : Compile time = 33.163 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 47.716 ms +sage_attention-torch-ext> ptxas info : Compile time = 29.798 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 41.273 ms +sage_attention-torch-ext> ptxas info : Compile time = 25.454 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 38.776 ms +sage_attention-torch-ext> ptxas info : Compile time = 24.426 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.406 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.628 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.111 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.532 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.007 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.466 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.035 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.505 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 6.928 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.399 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 6.485 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.402 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 5.804 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.699 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 5.653 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.664 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.750 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.805 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.113 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.824 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.936 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.822 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.215 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.834 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.871 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.798 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.819 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.737 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.988 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.737 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.837 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.741 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 20.280 ms +sage_attention-torch-ext> ptxas info : Compile time = 12.341 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 14.116 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.939 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 13.736 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.936 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 13.448 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.904 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 18.929 ms +sage_attention-torch-ext> ptxas info : Compile time = 12.040 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 13.647 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.790 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 13.319 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.670 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 13.264 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.754 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 17.227 ms +sage_attention-torch-ext> ptxas info : Compile time = 11.051 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.880 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.485 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.378 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.890 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.197 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.861 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 16.636 ms +sage_attention-torch-ext> ptxas info : Compile time = 10.674 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.659 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.790 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.735 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.746 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.882 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.777 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 17.822 ms +sage_attention-torch-ext> ptxas info : Compile time = 11.445 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.403 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.181 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.426 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.240 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.251 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.114 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 18.305 ms +sage_attention-torch-ext> ptxas info : Compile time = 11.251 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.513 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.052 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.229 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.064 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.345 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.040 ms sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 22.466 ms +sage_attention-torch-ext> ptxas info : Compile time = 22.917 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 20.754 ms +sage_attention-torch-ext> ptxas info : Compile time = 21.303 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 16.622 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 16.662 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.590 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.369 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.286 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.288 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.194 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.170 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.785 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.792 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.000 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.105 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.117 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.127 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.123 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.698 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.295 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.133 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.679 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.073 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.101 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.066 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.673 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.110 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.077 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.057 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.517 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.116 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.066 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.118 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.597 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.028 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.165 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.086 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.971 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.258 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.246 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.285 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.975 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.326 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.259 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.217 ms -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 29.770 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 27.921 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 22.255 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 16.938 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 22.250 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 17.139 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.477 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.465 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.404 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.333 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.402 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.264 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.398 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.236 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.334 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Compile time = 3.177 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.262 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Compile time = 3.136 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.805 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 2.730 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.876 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 2.698 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.098 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.144 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.180 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Compile time = 7.094 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.230 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.185 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.119 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Compile time = 7.197 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.241 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.164 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.773 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Compile time = 7.704 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.295 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.299 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.169 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Compile time = 7.152 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 12.442 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Compile time = 10.838 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.177 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.200 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.164 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Compile time = 8.188 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.109 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.118 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.826 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.233 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.210 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.194 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 11.782 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.427 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.885 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.938 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 11.257 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.465 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.747 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.217 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.086 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.344 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.385 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.296 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.133 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.429 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.385 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.303 ms -sage_attention-torch-ext> [2/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 631.269 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 678.510 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 889.763 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 842.414 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 848.507 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 860.566 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 852.066 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 857.813 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 798.712 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 814.182 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 766.806 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 563.761 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 460.277 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 462.993 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 565.805 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 557.043 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 226.177 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 223.149 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 228.610 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 229.459 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 224.016 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 223.754 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 231.030 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 231.032 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.149 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.080 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.757 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.788 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.921 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 257.808 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.223 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.484 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 256.288 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 255.649 ms -sage_attention-torch-ext> [4/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.230 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.249 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 9.688 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.245 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.220 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.298 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 9.794 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.254 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [3/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 12 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 798.894 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 12 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 793.628 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 737.636 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 579.885 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 582.862 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 569.953 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 564.402 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 562.262 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 487.942 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 539.907 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 56 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 618.986 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 56 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 629.778 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 496.290 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 494.479 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 609.734 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 599.355 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.338 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 239.366 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.208 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.283 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 241.087 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 255.448 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.353 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.479 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 239.158 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.457 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.570 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 247.500 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.915 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.433 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.784 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.211 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.127 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 269.930 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 284.631 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.480 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 319.602 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.393 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 259.218 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 268.477 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 266.415 ms -sage_attention-torch-ext> [4/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.393 ms sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 991.616 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 805.593 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 587.577 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 582.580 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 861.558 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 573.405 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 570.518 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 572.015 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 501.662 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 502.550 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 52 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 621.856 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 52 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 617.546 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 506.660 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 500.017 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 617.184 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 608.361 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 244.249 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 30.117 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.320 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 294.735 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 256.622 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 28.478 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.522 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 22.777 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.401 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.182 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 259.771 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 22.734 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 275.067 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.503 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.364 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.330 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.313 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.250 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.214 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 2.777 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 2.821 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.237 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.341 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.329 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.243 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.286 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.833 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.308 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.235 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.949 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.196 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 272.300 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 276.028 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 278.538 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.187 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 272.368 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.105 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 270.965 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 278.318 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 276.783 ms -sage_attention-torch-ext> [5/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.920 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.290 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.341 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 8.325 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 9.831 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.338 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.308 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.339 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 9.887 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.336 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.359 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.343 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [5/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 950.424 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 938.339 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 942.385 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 951.773 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 941.110 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 948.921 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 953.087 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 937.847 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 796.084 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 790.276 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1005.200 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1014.114 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 792.773 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 810.731 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1016.541 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1015.399 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.267 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.465 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 512.495 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.501 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 519.312 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 459.927 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 403.606 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 512.221 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 498.943 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 451.517 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 448.368 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.414 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 558.783 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.287 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 556.302 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 552.468 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 548.704 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.495 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 549.656 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.463 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 548.578 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 554.915 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 554.169 ms -sage_attention-torch-ext> [5/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/fused/fused.cu.o +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 7.366 ms +sage_attention-torch-ext> [4/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/fused/fused.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem +sage_attention-torch-ext> ptxas info : 10 bytes gmem sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 33.055 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 29.879 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 25.412 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 24.254 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 4.621 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 4.517 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 4.419 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 4.453 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 4.394 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 4.362 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 3.675 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 3.641 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.795 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.833 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.808 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.779 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.779 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.693 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.811 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.727 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.304 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.983 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.975 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.919 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.034 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.806 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.660 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.769 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.120 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.379 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.839 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.843 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 10.634 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.796 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.711 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.701 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.606 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.199 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.214 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.120 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.220 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.031 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.065 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.001 ms -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 32.115 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 30.290 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 23.510 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 24.156 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 4.940 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 4.758 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 4.801 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 4.800 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 4.554 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 4.284 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.852 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.773 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.896 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.918 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.213 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.558 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.731 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.301 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.800 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.677 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 15.202 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 11.401 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 11.767 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 11.408 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 15.133 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 11.391 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 11.290 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 11.324 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 13.323 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.568 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.806 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.917 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 13.234 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.500 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.014 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.819 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 13.673 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.071 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.839 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.729 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 13.484 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.911 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.806 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.552 ms +sage_attention-torch-ext> [3/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 575.531 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 573.134 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 477.685 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 475.948 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 549.043 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 547.290 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 554.362 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 552.915 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 480.589 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 485.373 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 609.560 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 961.641 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 839.593 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 833.012 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1012.245 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1012.331 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 30.458 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 389.256 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 387.388 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 393.773 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 397.407 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 387.125 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 384.227 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 392.006 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 395.280 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 28.607 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 429.918 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 22.856 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 410.430 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 417.292 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 419.824 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 22.959 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 405.015 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.522 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 404.581 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 402.388 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 387.610 ms +sage_attention-torch-ext> [5/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(627): warning #177-D: variable "padded_kv_len" was declared but never referenced +sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> Remark: The warnings can be suppressed with "-diag-suppress " +sage_attention-torch-ext> +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(803): warning #177-D: variable "padded_kv_len" was declared but never referenced +sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(170): warning #177-D: variable "sO" was declared but never referenced +sage_attention-torch-ext> half *sO = (half*)smem_; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> ptxas info : 10 bytes gmem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.464 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.487 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.405 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.377 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.360 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.899 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.911 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.329 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.396 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.795 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.739 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.421 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.937 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.493 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.361 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 11.107 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.381 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.306 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.143 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 11.281 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.424 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.406 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.362 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.822 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.583 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.380 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.357 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.895 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.353 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.400 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.508 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.780 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> [4/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 12 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 472.368 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 12 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 468.433 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 576.750 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 571.920 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 560.549 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 564.460 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 557.690 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 560.404 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 479.494 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 485.787 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 56 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 606.847 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 56 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 604.332 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 483.856 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 482.782 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 593.808 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 586.016 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.433 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 236.466 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.559 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 235.744 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 241.740 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 242.647 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.439 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 236.979 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.336 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 235.385 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 243.210 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 247.594 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.564 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 261.556 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.563 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 260.161 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 268.085 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 265.489 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.546 ms -sage_attention-torch-ext> [6/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu.o +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 259.951 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 258.783 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 267.131 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 263.264 ms +sage_attention-torch-ext> [3/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(627): warning #177-D: variable "padded_kv_len" was declared but never referenced sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; @@ -4247,2500 +3091,3873 @@ sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_ sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 254.401 ms +sage_attention-torch-ext> ptxas info : Compile time = 279.184 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 258.197 ms +sage_attention-torch-ext> ptxas info : Compile time = 271.092 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 292.728 ms +sage_attention-torch-ext> ptxas info : Compile time = 280.604 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 288.496 ms +sage_attention-torch-ext> ptxas info : Compile time = 283.624 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 259.609 ms +sage_attention-torch-ext> ptxas info : Compile time = 270.122 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 257.610 ms +sage_attention-torch-ext> ptxas info : Compile time = 269.930 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 310.313 ms +sage_attention-torch-ext> ptxas info : Compile time = 281.979 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 261.558 ms +sage_attention-torch-ext> ptxas info : Compile time = 278.125 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 305.399 ms +sage_attention-torch-ext> ptxas info : Compile time = 287.994 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 290.890 ms +sage_attention-torch-ext> ptxas info : Compile time = 285.258 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 321.909 ms +sage_attention-torch-ext> ptxas info : Compile time = 295.081 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 268.549 ms +sage_attention-torch-ext> ptxas info : Compile time = 291.477 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 305.193 ms +sage_attention-torch-ext> ptxas info : Compile time = 281.872 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 323.257 ms +sage_attention-torch-ext> ptxas info : Compile time = 284.176 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 306.249 ms +sage_attention-torch-ext> ptxas info : Compile time = 295.775 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 328.274 ms +sage_attention-torch-ext> ptxas info : Compile time = 291.576 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 198.513 ms +sage_attention-torch-ext> ptxas info : Compile time = 175.895 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 195.701 ms +sage_attention-torch-ext> ptxas info : Compile time = 179.338 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 203.394 ms +sage_attention-torch-ext> ptxas info : Compile time = 181.282 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 206.831 ms +sage_attention-torch-ext> ptxas info : Compile time = 186.177 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 188.040 ms +sage_attention-torch-ext> ptxas info : Compile time = 176.846 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 177.604 ms +sage_attention-torch-ext> ptxas info : Compile time = 178.792 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 134.328 ms +sage_attention-torch-ext> ptxas info : Compile time = 181.990 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 110.982 ms +sage_attention-torch-ext> ptxas info : Compile time = 184.368 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 112.759 ms +sage_attention-torch-ext> ptxas info : Compile time = 186.550 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 112.951 ms +sage_attention-torch-ext> ptxas info : Compile time = 185.837 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 117.419 ms +sage_attention-torch-ext> ptxas info : Compile time = 197.424 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 119.115 ms +sage_attention-torch-ext> ptxas info : Compile time = 193.863 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 113.767 ms +sage_attention-torch-ext> ptxas info : Compile time = 189.459 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 113.382 ms +sage_attention-torch-ext> ptxas info : Compile time = 187.229 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 117.939 ms +sage_attention-torch-ext> ptxas info : Compile time = 192.644 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 118.428 ms +sage_attention-torch-ext> ptxas info : Compile time = 195.045 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 161.385 ms +sage_attention-torch-ext> ptxas info : Compile time = 255.230 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 162.059 ms +sage_attention-torch-ext> ptxas info : Compile time = 255.710 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 163.835 ms +sage_attention-torch-ext> ptxas info : Compile time = 262.885 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 165.521 ms +sage_attention-torch-ext> ptxas info : Compile time = 263.839 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 159.573 ms +sage_attention-torch-ext> ptxas info : Compile time = 256.653 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 159.492 ms +sage_attention-torch-ext> ptxas info : Compile time = 253.572 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 163.714 ms +sage_attention-torch-ext> ptxas info : Compile time = 263.145 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 164.815 ms +sage_attention-torch-ext> ptxas info : Compile time = 262.722 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 171.328 ms +sage_attention-torch-ext> ptxas info : Compile time = 266.776 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 166.435 ms +sage_attention-torch-ext> ptxas info : Compile time = 266.834 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 172.348 ms +sage_attention-torch-ext> ptxas info : Compile time = 274.130 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 172.724 ms +sage_attention-torch-ext> ptxas info : Compile time = 272.791 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 168.493 ms +sage_attention-torch-ext> ptxas info : Compile time = 271.980 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 166.818 ms +sage_attention-torch-ext> ptxas info : Compile time = 265.778 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 173.286 ms +sage_attention-torch-ext> ptxas info : Compile time = 275.642 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 173.009 ms +sage_attention-torch-ext> ptxas info : Compile time = 278.223 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 105.165 ms +sage_attention-torch-ext> ptxas info : Compile time = 172.490 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 107.053 ms +sage_attention-torch-ext> ptxas info : Compile time = 176.868 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 112.071 ms +sage_attention-torch-ext> ptxas info : Compile time = 179.363 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 110.342 ms +sage_attention-torch-ext> ptxas info : Compile time = 182.724 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 106.804 ms +sage_attention-torch-ext> ptxas info : Compile time = 172.738 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 106.230 ms +sage_attention-torch-ext> ptxas info : Compile time = 176.483 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 111.573 ms +sage_attention-torch-ext> ptxas info : Compile time = 179.536 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 110.108 ms +sage_attention-torch-ext> ptxas info : Compile time = 181.599 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 113.093 ms +sage_attention-torch-ext> ptxas info : Compile time = 184.437 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 113.517 ms +sage_attention-torch-ext> ptxas info : Compile time = 187.298 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 117.870 ms +sage_attention-torch-ext> ptxas info : Compile time = 193.122 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 118.980 ms +sage_attention-torch-ext> ptxas info : Compile time = 195.503 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 114.139 ms +sage_attention-torch-ext> ptxas info : Compile time = 187.034 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 113.256 ms +sage_attention-torch-ext> ptxas info : Compile time = 184.630 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 116.641 ms +sage_attention-torch-ext> ptxas info : Compile time = 191.288 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 115.081 ms -sage_attention-torch-ext> [6/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn.cu.o +sage_attention-torch-ext> ptxas info : Compile time = 189.145 ms +sage_attention-torch-ext> [4/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(627): warning #177-D: variable "padded_kv_len" was declared but never referenced +sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> Remark: The warnings can be suppressed with "-diag-suppress " +sage_attention-torch-ext> +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(803): warning #177-D: variable "padded_kv_len" was declared but never referenced +sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(170): warning #177-D: variable "sO" was declared but never referenced +sage_attention-torch-ext> half *sO = (half*)smem_; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> ptxas info : 10 bytes gmem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [7/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 964.900 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 958.181 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 958.371 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 958.394 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 939.503 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 701.023 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 573.680 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 801.568 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 813.002 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 808.011 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1016.111 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 977.034 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 754.749 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 731.276 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 923.620 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 940.844 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 459.245 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 457.206 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 451.179 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 450.787 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 447.327 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 431.471 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 506.880 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 527.163 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 553.846 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 551.678 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 568.979 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 495.100 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 483.901 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 477.401 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 496.705 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 515.510 ms -sage_attention-torch-ext> [6/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 966.091 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 961.337 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 972.006 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 974.650 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 957.506 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 960.363 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> [5/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 971.507 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 970.596 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 807.611 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 802.233 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1033.905 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1024.056 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 810.552 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 796.703 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1004.351 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1012.638 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 510.266 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 512.618 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 521.439 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 524.050 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 512.244 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 508.562 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 529.261 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 519.651 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 547.796 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 543.208 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 560.784 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 556.368 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 546.293 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 550.874 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 558.939 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 549.771 ms -sage_attention-torch-ext> [7/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf.cu.o +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [4/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 264 bytes stack frame, 276 bytes spill stores, 280 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 866.249 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 264 bytes stack frame, 276 bytes spill stores, 280 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 862.299 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 288 bytes stack frame, 268 bytes spill stores, 268 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 873.782 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 288 bytes stack frame, 268 bytes spill stores, 268 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 875.831 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 272 bytes spill stores, 276 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 852.270 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 272 bytes spill stores, 276 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 851.476 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 880.964 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 863.392 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 224 bytes stack frame, 252 bytes spill stores, 248 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 224 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 772.722 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 224 bytes stack frame, 252 bytes spill stores, 248 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 224 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 777.240 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 216 bytes spill stores, 232 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 926.046 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 216 bytes spill stores, 232 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 931.904 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 216 bytes stack frame, 244 bytes spill stores, 236 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 216 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 749.924 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 216 bytes stack frame, 244 bytes spill stores, 236 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 216 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 758.203 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 946.039 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1016.690 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 405.534 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 412.191 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 403.205 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 387.725 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 387.513 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 435.032 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 397.812 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 438.839 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 443.567 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 445.955 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 461.657 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 467.032 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 483.735 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 488.336 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 497.941 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [8/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn_inst_buf.cu.o +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 474.535 ms +sage_attention-torch-ext> [6/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn_inst_buf.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> 48 bytes stack frame, 68 bytes spill stores, 52 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 551.911 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> 48 bytes stack frame, 68 bytes spill stores, 52 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 541.946 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 553.795 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 561.058 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 548.291 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 554.796 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> 48 bytes stack frame, 20 bytes spill stores, 16 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 562.109 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> 48 bytes stack frame, 20 bytes spill stores, 16 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 552.458 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 463.777 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 463.453 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 583.598 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 589.296 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 465.519 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 466.966 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 594.830 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 589.614 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 304.504 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 292.156 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.004 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 262.724 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 304.503 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 297.500 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.810 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 256.907 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 539.805 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 540.085 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 547.628 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 539.854 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 538.118 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 518.144 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 519.476 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 509.056 ms -sage_attention-torch-ext> [6/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf.cu.o +sage_attention-torch-ext> [6/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_attn_inst_buf.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 264 bytes stack frame, 276 bytes spill stores, 280 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 264 bytes stack frame, 268 bytes spill stores, 280 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 264 bytes stack frame, 276 bytes spill stores, 280 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 264 bytes stack frame, 268 bytes spill stores, 280 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 288 bytes stack frame, 268 bytes spill stores, 268 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 288 bytes stack frame, 260 bytes spill stores, 268 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 288 bytes stack frame, 268 bytes spill stores, 268 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 288 bytes stack frame, 260 bytes spill stores, 268 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 272 bytes spill stores, 276 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 272 bytes spill stores, 276 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 256 bytes spill stores, 260 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 256 bytes spill stores, 260 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 224 bytes stack frame, 252 bytes spill stores, 248 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 224 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 224 bytes stack frame, 252 bytes spill stores, 248 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 224 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 216 bytes spill stores, 232 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 216 bytes spill stores, 232 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 216 bytes stack frame, 244 bytes spill stores, 236 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 216 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 216 bytes stack frame, 244 bytes spill stores, 236 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 216 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 296 bytes spill stores, 300 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 296 bytes spill stores, 300 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 232 bytes spill stores, 248 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 232 bytes spill stores, 248 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 292 bytes spill stores, 292 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 292 bytes spill stores, 292 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 228 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 228 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [5/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 733.104 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 731.085 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 918.992 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 918.653 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 915.256 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 912.595 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 909.402 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 909.714 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 772.525 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 775.748 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 967.721 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 961.232 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 775.724 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 780.124 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 950.656 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 954.345 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 375.330 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 381.202 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 390.741 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 388.969 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 375.466 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 384.207 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 383.898 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 388.396 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 422.907 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 419.306 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 424.313 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 429.593 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 419.164 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 420.692 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 425.393 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [8/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn_inst_buf.cu.o +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 423.774 ms +sage_attention-torch-ext> [7/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 68 bytes spill stores, 52 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 68 bytes spill stores, 52 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 20 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 20 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [7/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu.o +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [5/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn_inst_buf.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(627): warning #177-D: variable "padded_kv_len" was declared but never referenced -sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> Remark: The warnings can be suppressed with "-diag-suppress " -sage_attention-torch-ext> -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(803): warning #177-D: variable "padded_kv_len" was declared but never referenced -sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(170): warning #177-D: variable "sO" was declared but never referenced -sage_attention-torch-ext> half *sO = (half*)smem_; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> ptxas info : 11 bytes gmem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 250.137 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 245.084 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 267.574 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 296.208 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 260.850 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 249.064 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 270.699 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 286.529 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 272.624 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 307.901 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 280.865 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 303.936 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 258.298 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 294.248 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 323.737 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 295.934 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 198.053 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 192.406 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 198.137 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 196.772 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 191.623 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 191.442 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 179.258 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 160.342 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 109.085 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 109.712 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 115.319 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 115.636 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 110.774 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 110.079 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 113.757 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 113.820 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 993.379 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 983.666 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 991.494 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 995.815 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 984.680 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 977.430 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 996.712 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 997.032 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 817.076 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 810.474 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1036.490 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 737.972 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 496.712 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 487.620 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 619.107 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 621.681 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 154.069 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 315.178 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 153.781 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 157.918 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 158.807 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 314.475 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 321.573 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 322.633 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 153.605 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 353.122 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 152.656 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 157.839 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 158.870 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 310.499 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 321.386 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 321.051 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 160.435 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 337.406 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 160.455 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 165.937 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 165.331 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 337.050 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 342.868 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 344.463 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 161.712 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 336.617 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 160.096 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 164.483 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 166.942 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 334.307 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 346.369 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 345.404 ms +sage_attention-torch-ext> [6/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 970.903 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 969.569 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 992.699 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 993.083 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 958.046 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 972.919 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 951.359 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 948.223 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 820.379 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 829.366 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 52 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1012.789 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 52 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1003.097 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 813.168 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 817.103 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 998.870 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1007.322 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 102.455 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 390.714 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 102.423 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 105.992 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 106.119 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 394.228 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 413.533 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 408.229 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 102.173 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 387.351 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 102.529 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 107.337 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 106.419 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 379.782 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 380.296 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 381.719 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 108.601 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 417.380 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 109.895 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 120.216 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 114.583 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 426.731 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 456.481 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 460.196 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 110.804 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 412.414 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 108.320 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 111.462 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 110.934 ms -sage_attention-torch-ext> [9/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_attn_inst_buf.cu.o +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 269.761 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 272.479 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 373.993 ms +sage_attention-torch-ext> [8/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 264 bytes stack frame, 268 bytes spill stores, 280 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 264 bytes stack frame, 268 bytes spill stores, 280 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 288 bytes stack frame, 260 bytes spill stores, 268 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 288 bytes stack frame, 260 bytes spill stores, 268 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 256 bytes spill stores, 260 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 256 bytes spill stores, 260 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 296 bytes spill stores, 300 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 296 bytes spill stores, 300 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 232 bytes spill stores, 248 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 232 bytes spill stores, 248 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 292 bytes spill stores, 292 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 272 bytes stack frame, 292 bytes spill stores, 292 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 228 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 228 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [7/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/fused/fused.cu.o +sage_attention-torch-ext> [9/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [6/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(627): warning #177-D: variable "padded_kv_len" was declared but never referenced +sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> Remark: The warnings can be suppressed with "-diag-suppress " +sage_attention-torch-ext> +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(803): warning #177-D: variable "padded_kv_len" was declared but never referenced +sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(170): warning #177-D: variable "sO" was declared but never referenced +sage_attention-torch-ext> half *sO = (half*)smem_; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> ptxas info : 11 bytes gmem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 287.845 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 288.839 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 295.844 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 294.602 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 285.738 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 288.852 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 291.369 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 291.160 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 298.561 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 295.403 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 304.758 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 303.732 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 291.363 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 294.677 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 299.220 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 220.822 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 109.424 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 144.841 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 194.590 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 194.206 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 189.093 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 188.492 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 196.098 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 195.004 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 197.927 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 196.450 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 200.944 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 214.660 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 199.900 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 197.293 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 205.149 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 211.099 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 271.682 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 269.586 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 281.733 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 277.306 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 269.288 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 274.055 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 276.263 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 276.553 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 286.596 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 280.377 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 291.432 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 294.755 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 281.571 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 280.015 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 294.676 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 289.791 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 183.860 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 182.850 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 194.201 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 188.360 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 183.835 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 182.459 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 194.673 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 188.311 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 190.778 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 192.092 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 206.914 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 205.892 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 192.388 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 192.225 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 202.459 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 200.399 ms +sage_attention-torch-ext> [7/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 989.903 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 975.144 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 994.561 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1001.132 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 961.806 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 977.769 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 973.138 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 976.834 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 836.467 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 863.853 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 52 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1048.856 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 52 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1044.048 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 838.270 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 842.047 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1035.935 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 769.188 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 240.795 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 244.068 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 248.699 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 250.391 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 243.355 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 249.877 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 248.939 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 251.494 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 268.645 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 265.193 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 268.712 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 270.373 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 263.379 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 262.208 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 271.118 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 268.399 ms +sage_attention-torch-ext> [7/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 983.256 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 988.467 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 818.638 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 807.287 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 951.565 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 953.206 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 957.961 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 960.994 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 832.750 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 831.568 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1073.839 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1058.976 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 861.042 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 862.609 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1053.474 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1067.258 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 395.895 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 393.060 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 401.657 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 404.080 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 392.307 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 391.283 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 403.284 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 408.327 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 437.440 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 435.068 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 442.290 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 440.862 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 382.225 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 253.894 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 259.214 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 257.814 ms +sage_attention-torch-ext> [8/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 864.240 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 888.745 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 994.582 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 942.025 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 938.553 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 911.255 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 984.241 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1004.147 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 939.994 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 900.768 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 948.356 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 633.607 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 753.372 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 850.671 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1081.473 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1069.612 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 389.144 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 420.446 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 428.036 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 426.218 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 414.409 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 416.887 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 428.684 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 423.411 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 425.723 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 418.213 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 465.548 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 468.465 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 457.939 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 461.659 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 475.348 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 455.041 ms +sage_attention-torch-ext> [7/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 24 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 24 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 88 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 88 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 24 bytes stack frame, 36 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 24 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 24 bytes stack frame, 36 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 24 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [8/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 740.709 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 732.739 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 939.937 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 928.398 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 924.225 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 924.967 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 932.908 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 920.612 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 778.830 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 485.143 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 577.442 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 575.445 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 464.965 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 466.262 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 777.273 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 954.850 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 392.003 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 395.791 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 403.954 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 402.820 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 392.565 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 372.358 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 392.180 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 398.389 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 427.351 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 426.514 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 431.372 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 432.750 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 420.279 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 247 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 414.583 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 389.395 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 414.831 ms +sage_attention-torch-ext> [8/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 264 bytes stack frame, 276 bytes spill stores, 280 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 264 bytes stack frame, 276 bytes spill stores, 280 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 264 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 288 bytes stack frame, 268 bytes spill stores, 268 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 288 bytes stack frame, 268 bytes spill stores, 268 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 288 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 272 bytes spill stores, 276 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 272 bytes spill stores, 276 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 272 bytes stack frame, 264 bytes spill stores, 264 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 272 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 224 bytes stack frame, 252 bytes spill stores, 248 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 224 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 224 bytes stack frame, 252 bytes spill stores, 248 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 224 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 216 bytes spill stores, 232 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 216 bytes spill stores, 232 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 216 bytes stack frame, 244 bytes spill stores, 236 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 216 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 216 bytes stack frame, 244 bytes spill stores, 236 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 216 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 256 bytes stack frame, 212 bytes spill stores, 224 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 256 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 235 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 242 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [9/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 12 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 772.097 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 12 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 764.307 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 963.047 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 974.212 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 953.195 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 950.212 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 956.490 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 944.377 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 800.398 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 822.177 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 56 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1017.284 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 56 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1026.028 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 818.620 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 812.880 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1011.282 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 987.653 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 401.829 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 402.088 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 406.141 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 408.398 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 400.601 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 399.658 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 410.496 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 410.007 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 436.062 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 433.903 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 442.161 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 441.720 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 429.612 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 430.207 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 437.492 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 437.610 ms +sage_attention-torch-ext> [9/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [10/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 24 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 24 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 88 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 88 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 24 bytes stack frame, 36 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 24 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 24 bytes stack frame, 36 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 24 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> [8/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_attn_inst_buf.cu.o +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [9/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn_inst_buf.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 589.821 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 586.041 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 491.060 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 494.720 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 773.047 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 965.297 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 969.303 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Compile time = 567.956 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 953.702 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 799.895 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 791.529 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 980.682 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 968.691 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 796.989 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 785.876 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 925.652 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 925.953 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 555.349 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 566.194 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 568.174 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 770.150 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 936.199 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 976.778 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 964.694 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 793.000 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 801.383 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1027.038 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1015.406 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 746.223 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 809.856 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1023.711 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1014.847 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 387.605 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 506.221 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 400.010 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 529.578 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 415.420 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 445.260 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 415.903 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 405.187 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 402.514 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 444.036 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 514.511 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 514.736 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 413.632 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 443.054 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 364.884 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 411.922 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 396.414 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 491.946 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 379.967 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 499.711 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 395.708 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 553.157 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 397.412 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 553.380 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 403.622 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 542.357 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 419.375 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 541.216 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 421.370 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 545.634 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 399.024 ms -sage_attention-torch-ext> [10/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/fused/fused.cu.o +sage_attention-torch-ext> ptxas info : Compile time = 539.954 ms +sage_attention-torch-ext> [10/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/fused/fused.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 10 bytes gmem sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' @@ -7273,3029 +7490,2521 @@ sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> [9/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn.cu.o +sage_attention-torch-ext> [10/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/fused/fused.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 12 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 788.068 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 12 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 796.846 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 982.791 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 992.436 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 967.039 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 955.511 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 948.729 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 947.350 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 804.728 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 44 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 688.548 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 56 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 604.303 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 56 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 591.514 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 478.180 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 474.254 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 585.374 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 575.116 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : 11 bytes gmem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 235.784 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 49.500 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 233.797 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 237.642 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 238.282 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 50.131 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.575 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 42.722 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 229.730 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 236.005 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 236.815 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 40.751 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 7.398 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 7.277 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 7.262 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 7.268 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers +sage_attention-torch-ext> ptxas info : Compile time = 7.002 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers +sage_attention-torch-ext> ptxas info : Compile time = 6.967 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers +sage_attention-torch-ext> ptxas info : Compile time = 5.811 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers +sage_attention-torch-ext> ptxas info : Compile time = 5.598 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.524 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.467 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.386 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.471 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.348 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 11.941 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.091 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.620 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 20.600 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.679 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 14.674 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.601 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 259.298 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 257.797 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 14.472 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.296 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 14.295 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 286.895 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 373.298 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 431.989 ms -sage_attention-torch-ext> [9/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 610.185 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 602.678 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 505.530 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 514.490 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 580.128 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 578.117 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 578.034 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 579.730 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 504.559 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 508.363 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 639.247 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 638.366 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 794.307 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 870.622 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1100.127 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1076.237 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 19.425 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 392.309 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 14.192 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 410.184 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 419.790 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 423.421 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 13.751 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 409.855 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 14.187 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 405.905 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 415.085 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 417.644 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 17.681 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 420.401 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.777 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 410.354 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 458.834 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 460.341 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.765 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 454.346 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.743 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 453.572 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 464.660 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 450.779 ms -sage_attention-torch-ext> [10/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1026.913 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1039.187 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1035.649 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1034.679 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 943.433 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 603.724 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 608.512 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 615.357 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 592.052 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 855.749 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1111.245 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1130.304 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 884.960 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 906.424 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1103.580 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1094.002 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 17.817 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.346 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 401.149 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.249 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 391.095 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 442.479 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 440.891 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.044 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 428.954 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 18.788 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 437.536 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 459.124 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 442.354 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.942 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 462.155 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 13.131 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 471.281 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 489.585 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 469.265 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.947 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 451.902 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 17.883 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 457.616 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 473.635 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 469.484 ms -sage_attention-torch-ext> [8/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 24 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 24 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 88 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 88 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 24 bytes stack frame, 36 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 24 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 24 bytes stack frame, 36 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 24 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 12.814 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 13.122 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 13.020 ms +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 37.329 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 34.626 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 26.932 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 27.830 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.291 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.140 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.046 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.110 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.232 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 4.433 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.859 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 3.886 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.258 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.055 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.095 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.789 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.907 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.512 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.032 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.792 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 16.606 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 12.263 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [10/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 984.592 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 990.417 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1003.978 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 991.593 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 973.002 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 963.093 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 974.741 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 970.046 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 829.686 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 778.324 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 52 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 814.682 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 52 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 994.113 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 826.298 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 831.656 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 996.818 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 1000.920 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 14.055 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 406.501 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 14.197 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 410.054 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 419.738 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 419.590 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 17.163 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 381.081 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.638 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 371.843 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 422.365 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 424.444 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.375 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 460.547 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 12.070 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 454.437 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 464.828 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 448.633 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 14.833 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 452.369 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.874 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 477.124 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 455.888 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 448.456 ms -sage_attention-torch-ext> [9/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 68 bytes spill stores, 52 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 68 bytes spill stores, 52 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 20 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 20 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.909 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.978 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 15.043 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.951 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 12.474 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 10.900 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 16.332 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [10/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.250 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.182 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.199 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 15.566 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 230 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.156 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.166 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.136 ms +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 49.535 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 47.652 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 37.672 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 38.065 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.538 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.435 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.551 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.370 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.281 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 5.141 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 4.571 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 4.491 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.627 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.487 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 237 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 232 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [2/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(627): warning #177-D: variable "padded_kv_len" was declared but never referenced -sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> Remark: The warnings can be suppressed with "-diag-suppress " -sage_attention-torch-ext> -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(803): warning #177-D: variable "padded_kv_len" was declared but never referenced -sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(170): warning #177-D: variable "sO" was declared but never referenced -sage_attention-torch-ext> half *sO = (half*)smem_; -sage_attention-torch-ext> ^ -sage_attention-torch-ext> -sage_attention-torch-ext> ptxas info : 28 bytes gmem -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.597 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 164.799 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.487 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 162.299 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 167.071 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 167.305 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.691 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 163.529 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 12.329 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 162.413 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 167.845 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 166.829 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.763 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 170.956 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.627 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 171.643 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 171.202 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 169.275 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 17.892 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 162.832 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 13.243 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 163.102 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 167.108 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 167.067 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 13.226 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 102.579 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 13.182 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 102.599 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 106.308 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 106.308 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 17.771 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 101.930 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 13.294 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 101.802 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 105.487 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 106.022 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 13.345 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 107.688 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 13.287 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 108.189 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 112.518 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 112.481 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 15.862 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 108.660 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.622 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 108.254 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 112.380 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 111.712 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.658 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 151.116 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.690 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 150.731 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 155.635 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 156.071 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 15.902 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 150.923 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.653 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 151.125 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 155.642 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 155.914 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.698 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 158.104 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.623 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 159.048 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 163.044 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 163.951 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 16.319 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 159.584 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 11.940 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 160.201 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 160.848 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 160.423 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 12.290 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 100.711 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 12.074 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 100.134 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 102.477 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 102.970 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 16.693 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 99.158 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 12.039 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 100.406 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 103.529 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 102.484 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 12.142 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 105.449 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 12.028 ms +sage_attention-torch-ext> [10/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 931.635 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 937.720 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 959.342 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 931.747 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 912.236 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 972.129 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 974.563 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 974.417 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 817.109 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 783.288 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 1016.236 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 958.440 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 725.558 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 788.638 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 999.569 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 988.327 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 105.326 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 110.221 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 108.746 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 500.264 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 105.144 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 497.664 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 433.531 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 434.755 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 507.129 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 506.114 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 438.820 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 432.929 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 105.226 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 109.365 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 109.624 ms -sage_attention-torch-ext> [3/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 526.397 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 428.455 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 525.905 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 550.307 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 523.220 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 429.635 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Compile time = 529.581 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 533.405 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 536.842 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 533.635 ms +sage_attention-torch-ext> [2/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 534.947 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 523.143 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 529.204 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 534.048 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 530.867 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 530.137 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 535.064 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 533.832 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 519.790 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Compile time = 456.808 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 516.317 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 450.443 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 564.432 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 566.371 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 452.925 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 453.794 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 562.106 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 563.019 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 507.478 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 281.333 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 514.600 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 276.018 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 509.573 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 248.588 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 505.110 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 446.349 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 448.854 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 546.233 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 540.883 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 446.249 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 448.848 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 543.742 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 540.598 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 222.762 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 223.708 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 245.696 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 281.997 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 282.024 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 228.874 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 246.247 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 230.260 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 246.254 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 236 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 221.065 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 304.576 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 236 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 221.723 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 302.526 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 229 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 226.020 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 311.316 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 229 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 226.397 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 312.464 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 240.851 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 303.259 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 240.220 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 304.059 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 247.410 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 305.252 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.724 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 308.140 ms +sage_attention-torch-ext> [11/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/qk_int_sv_f16_cuda_sm80.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 243 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 239.628 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 243 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 239.573 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.137 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 244.431 ms -sage_attention-torch-ext> [4/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 448.461 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 445.120 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 531.424 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 528.424 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 530.435 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 528.854 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 530.439 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 532.033 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 462.295 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 464.516 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 569.408 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 568.447 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 462.865 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 463.457 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 560.896 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 561.562 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 234.511 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 232.415 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 238.152 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 240.639 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 233.681 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 231.057 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 235.930 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 233.246 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 243 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 247.219 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 227 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 243 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.739 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 217 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 252.486 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.085 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 246.876 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 247.286 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 252.920 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 202 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.212 ms -sage_attention-torch-ext> [5/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 556.352 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 553.921 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 467.748 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 468.780 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 530.531 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 533.354 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 536.300 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 537.186 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 473.500 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 479.248 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 593.492 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 591.683 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 496.066 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 493.646 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 592.271 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 592.168 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 202 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 194 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 194 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 185 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 185 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 246 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 246 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 176 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 176 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 184 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 184 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 180 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 180 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 233.115 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 178 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 178 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 226.961 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 234.252 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 235.465 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 229.104 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 226.404 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 233.070 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 235.435 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.171 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 249.628 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 256.891 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 256.335 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.943 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.335 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 257.577 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 255.777 ms -sage_attention-torch-ext> [6/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 548.097 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 531.494 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 541.595 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 8 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 541.311 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 543.318 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 84 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 540.357 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 540.372 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 540.675 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 465.859 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 64 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 457.031 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 573.905 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 573.667 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 460.213 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 36 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 460.097 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 568.269 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 568.915 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 284.879 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 283.571 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.400 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.976 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 287.723 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 285.951 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 201 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 251.519 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.311 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 207 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 309.154 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 204 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 307.970 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 204 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 319.757 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 317.137 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 308.197 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 307.759 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 312.691 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 312.052 ms -sage_attention-torch-ext> [7/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 568.453 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 568.150 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 574.691 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 577.777 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 555.916 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 554.288 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 564.822 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 565.896 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 519.489 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 509.876 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 608.774 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 610.988 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 496.281 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 497.232 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 598.825 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 597.024 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 236.767 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 240.665 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 246.634 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 246.912 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 60 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 60 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 112 bytes stack frame, 68 bytes spill stores, 76 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 112 bytes stack frame, 68 bytes spill stores, 76 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 56 bytes stack frame, 56 bytes spill stores, 60 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 56 bytes stack frame, 56 bytes spill stores, 60 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 60 bytes spill stores, 64 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 60 bytes spill stores, 64 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 236.834 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 240.000 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.898 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 244.002 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.691 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.304 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 264.469 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 264.533 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 259.612 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.095 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.757 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 266.410 ms -sage_attention-torch-ext> [8/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn_inst_buf.cu.o +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> [3/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 573.052 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 569.894 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 575.749 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 576.139 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 565.428 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 565.432 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 452.454 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 8 bytes stack frame, 4 bytes spill stores, 8 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 8 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 443.502 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 570.960 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Compile time = 532.956 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 567.129 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Compile time = 537.629 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 534.217 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 531.217 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 530.849 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 533.586 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 479.405 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Compile time = 460.908 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 481.992 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 465.682 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 599.426 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 569.037 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 600.968 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 487.144 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 482.863 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 597.829 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 599.207 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Compile time = 573.063 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 465.808 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 458.740 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 561.391 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 563.473 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 304.359 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 234.014 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 303.989 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 232.164 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 310.315 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 237.732 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 310.421 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 239.603 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 302.690 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 235.259 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 300.923 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 230.945 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 310.178 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 237.116 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 312.938 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 234.590 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 328.845 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 243 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 248.623 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 327.077 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 243 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 249.389 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 333.822 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 253.105 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 333.667 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 255.004 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 327.900 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 247.180 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 327.791 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 240 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 248.331 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 334.298 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 253.542 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 333.854 ms -sage_attention-torch-ext> [11/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/qk_int_sv_f16_cuda_sm80.cu.o +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 252.644 ms +sage_attention-torch-ext> [4/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(627): warning #177-D: variable "padded_kv_len" was declared but never referenced +sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> Remark: The warnings can be suppressed with "-diag-suppress " +sage_attention-torch-ext> +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(803): warning #177-D: variable "padded_kv_len" was declared but never referenced +sage_attention-torch-ext> int qo_len, kv_len, padded_kv_len, num_qo_heads, num_kv_heads; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> /build/source/sage_attention/qattn/qk_int_sv_f8_cuda_sm90.cu(170): warning #177-D: variable "sO" was declared but never referenced +sage_attention-torch-ext> half *sO = (half*)smem_; +sage_attention-torch-ext> ^ +sage_attention-torch-ext> +sage_attention-torch-ext> ptxas info : 28 bytes gmem +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 582.117 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 171.789 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 544.992 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 584.988 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 555.232 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 168.792 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 172.899 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 174.321 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 454.499 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 168.434 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 439.407 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 572.154 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 542.048 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 168.177 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 173.675 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 172.640 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 585.747 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 176.953 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 559.584 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 597.075 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 568.451 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 177.873 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 176.568 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 176.119 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 588.464 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 169.429 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 558.150 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 596.007 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 567.032 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 169.082 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 174.087 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 174.500 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 227 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 243.884 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 106.503 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 217 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 236.015 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.009 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 256.923 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 106.435 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 109.738 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 109.539 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 258.981 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 106.161 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.514 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 202 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.371 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 202 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 256.704 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 105.674 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 109.413 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 109.181 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 266.321 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 111.543 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 255.617 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 284.004 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 275.143 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 112.125 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 116.869 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 117.121 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 275.742 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 112.579 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 267.119 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 194 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 284.730 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 194 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 276.077 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 112.528 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 116.400 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb1EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 116.183 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 185 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 197.321 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 156.785 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 185 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 197.197 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 156.825 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 246 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 206.997 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 161.773 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 246 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 213.504 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 161.445 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 176 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 217.085 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 156.797 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 176 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 217.001 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 156.440 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 184 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 222.448 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 161.382 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 184 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 216.447 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 161.569 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 180 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 222.938 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 164.343 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 180 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 220.112 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 163.968 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 225.064 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 169.839 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 225.100 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 170.616 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 221.324 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 165.412 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 220.230 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 167 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 166.605 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 178 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 224.575 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 166.927 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj128EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 178 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 224.262 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 168 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 166.854 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.648 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 102.785 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.222 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 272.418 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 271.543 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 103.132 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 106.373 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 106.519 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 244.082 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 103.314 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.818 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 270.600 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 271.513 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 103.217 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 106.862 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode0ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 106.648 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 285.432 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 108.480 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 284.214 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 290.738 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 290.681 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 108.160 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 113.455 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity3ELS0_3E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 113.071 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 281.864 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 109.215 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb0ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 289.314 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 298.218 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 297.861 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 109.279 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E13__nv_bfloat16L8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 113.087 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf' for 'sm_90a' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int8_sv_f8_attn_kernelILj64ELj128ELj128ELj64EL16QuantGranularity2ELS0_2E6__halfL8MaskMode1ELb1ELb0EEv14CUtensorMap_stS3_S3_PfS4_S4_PT5_S4_jjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 128 registers, used 1 barriers, 16 bytes cumulative stack size, 128 bytes smem +sage_attention-torch-ext> ptxas info : Compile time = 113.033 ms +sage_attention-torch-ext> [12/12] Linking CXX shared module _sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> buildPhase completed in 3 minutes 30 seconds +sage_attention-torch-ext> Running phase: installPhase +sage_attention-torch-ext> install flags: -j21 install +sage_attention-torch-ext> [0/1] Install the project... +sage_attention-torch-ext> -- Install configuration: "Release" +sage_attention-torch-ext> -- Installing: /nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/_sage_attention_af2d0c0_dirty/_sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> Running phase: fixupPhase +sage_attention-torch-ext> shrinking RPATHs of ELF executables and libraries in /nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext +sage_attention-torch-ext> shrinking /nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext/sage_attention/_sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> checking for references to /build/ in /nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext... +sage_attention-torch-ext> patching script interpreter paths in /nix/store/2bvjs99wvlawr8lk16ihaa9vsjigppcw-sage_attention-torch-ext +sage_attention-torch-ext> Running phase: installCheckPhase +sage_attention-torch-ext> no Makefile or custom installCheckPhase, doing nothing +sage_attention-torch-ext> Checking of ABI compatibility +sage_attention-torch-ext> 🐍 Checking for compatibility with manylinux_2_28 and Python ABI version 3.9 +sage_attention-torch-ext> ✅ No compatibility issues found +sage_attention-torch-ext> Checking loading kernel with get_kernel +sage_attention-torch-ext> Check whether the kernel can be loaded with get-kernel: sage_attention +sage_attention-torch-ext> [5/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 558.324 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 88 bytes stack frame, 156 bytes spill stores, 148 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 88 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 556.812 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 469.632 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 96 bytes spill stores, 92 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 471.587 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 536.044 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 32 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 535.069 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 536.844 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 20 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 540.558 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 475.283 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 84 bytes spill stores, 76 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 479.128 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 593.466 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 595.802 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 495.385 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 164 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 495.046 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 594.391 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 112 bytes stack frame, 136 bytes spill stores, 128 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 592.701 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 547.650 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 230.920 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 515.155 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 228.210 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 548.305 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 233.322 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 522.745 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 235.323 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 426.900 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 227.194 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 405.909 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 227.237 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 539.708 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 231.856 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 512.176 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 235.053 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 555.106 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 254.284 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 529.515 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 250.838 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 566.544 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 256.750 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 539.557 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 256.867 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 554.977 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 248.230 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 526.419 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 250.416 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 565.716 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 255.372 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 537.362 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 254.559 ms +sage_attention-torch-ext> [6/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 554.827 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 555.616 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 558.617 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 76 bytes spill stores, 44 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 557.441 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 550.462 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 28 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 550.848 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 555.254 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 16 bytes spill stores, 12 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 552.498 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 469.257 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 470.105 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 580.429 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 584.866 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 473.537 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 52 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 471.687 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 579.194 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 44 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 583.672 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 201 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.289 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 296.443 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 234.398 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 295.757 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.279 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 300.175 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 207 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 239.625 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 304.243 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 204 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 241.500 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 292.201 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 204 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 234.725 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 289.473 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 247.759 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 300.834 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 239.854 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 300.674 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 261.146 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 319.536 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 251.628 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 316.038 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 267.309 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 321.601 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 258.554 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 259.758 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 321.080 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 252.118 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 314.583 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 315.980 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 267.376 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 321.983 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 259.264 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 477.204 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 471.498 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 485.235 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 491.403 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 469.169 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 472.946 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 472.230 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 471.596 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 60 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 487.877 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 60 bytes spill stores, 68 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 488.877 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 112 bytes stack frame, 68 bytes spill stores, 76 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 499.892 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 112 bytes stack frame, 68 bytes spill stores, 76 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 500.543 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 56 bytes stack frame, 56 bytes spill stores, 60 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 490.862 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 56 bytes stack frame, 56 bytes spill stores, 60 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 488.826 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 60 bytes spill stores, 64 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 493.915 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 96 bytes stack frame, 60 bytes spill stores, 64 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 493.916 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 321.031 ms +sage_attention-torch-ext> [7/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 571.810 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 56 bytes stack frame, 72 bytes spill stores, 56 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 571.291 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 578.094 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 581.878 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 558.359 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 16 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 553.480 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 563.480 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 32 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 566.917 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 515.458 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 176 bytes spill stores, 168 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 508.550 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 611.868 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 96 bytes stack frame, 92 bytes spill stores, 88 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 619.209 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 499.153 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 80 bytes spill stores, 68 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 495.273 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 600.253 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 60 bytes spill stores, 48 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 597.701 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.825 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 235.006 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 247.184 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 251 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 239.640 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.799 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 245.234 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.603 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 245.114 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.420 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 236.012 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.985 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 238.932 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.939 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 243.348 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.712 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 242.750 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 264.742 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 259.797 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.494 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 259.453 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 272.510 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 265.116 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 273.021 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 263.742 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 264.256 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 259.722 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.618 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 271.779 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 259.610 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 273.684 ms -sage_attention-torch-ext> [12/12] Linking CXX shared module _sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> buildPhase completed in 3 minutes 37 seconds -sage_attention-torch-ext> Running phase: installPhase -sage_attention-torch-ext> install flags: -j21 install -sage_attention-torch-ext> [0/1] Install the project... -sage_attention-torch-ext> -- Install configuration: "Release" -sage_attention-torch-ext> -- Installing: /nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/_sage_attention_44b112f_dirty/_sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> Running phase: fixupPhase -sage_attention-torch-ext> shrinking RPATHs of ELF executables and libraries in /nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext -sage_attention-torch-ext> shrinking /nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext/sage_attention/_sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> checking for references to /build/ in /nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext... -sage_attention-torch-ext> patching script interpreter paths in /nix/store/6jmb39fj6a2hjg90bjmrxna5vkivwy8n-sage_attention-torch-ext -sage_attention-torch-ext> Running phase: installCheckPhase -sage_attention-torch-ext> no Makefile or custom installCheckPhase, doing nothing -sage_attention-torch-ext> Checking of ABI compatibility -sage_attention-torch-ext> 🐍 Checking for compatibility with manylinux_2_28 and Python ABI version 3.9 -sage_attention-torch-ext> ✅ No compatibility issues found -sage_attention-torch-ext> Checking loading kernel with get_kernel -sage_attention-torch-ext> Check whether the kernel can be loaded with get-kernel: sage_attention -sage_attention-torch-ext> [11/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/qk_int_sv_f16_cuda_sm80.cu.o +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 266.354 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb0ELb1EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 265.664 ms +sage_attention-torch-ext> [11/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/qk_int_sv_f16_cuda_sm80.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' @@ -10810,18 +10519,18 @@ sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_s sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [12/12] Linking CXX shared module _sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> buildPhase completed in 3 minutes 42 seconds +sage_attention-torch-ext> [12/12] Linking CXX shared module _sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> buildPhase completed in 3 minutes 45 seconds sage_attention-torch-ext> Running phase: installPhase sage_attention-torch-ext> install flags: -j21 install sage_attention-torch-ext> [0/1] Install the project... sage_attention-torch-ext> -- Install configuration: "Release" -sage_attention-torch-ext> -- Installing: /nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/_sage_attention_44b112f_dirty/_sage_attention_44b112f_dirty.abi3.so +sage_attention-torch-ext> -- Installing: /nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/_sage_attention_af2d0c0_dirty/_sage_attention_af2d0c0_dirty.abi3.so sage_attention-torch-ext> Running phase: fixupPhase -sage_attention-torch-ext> shrinking RPATHs of ELF executables and libraries in /nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext -sage_attention-torch-ext> shrinking /nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext/sage_attention/_sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> checking for references to /build/ in /nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext... -sage_attention-torch-ext> patching script interpreter paths in /nix/store/794gpxvpsn0jkzi2zndpd6i6nhspwid2-sage_attention-torch-ext +sage_attention-torch-ext> shrinking RPATHs of ELF executables and libraries in /nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext +sage_attention-torch-ext> shrinking /nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext/sage_attention/_sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> checking for references to /build/ in /nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext... +sage_attention-torch-ext> patching script interpreter paths in /nix/store/0pmiqd0nyndanj9rwlfnry7dzb3ad6cs-sage_attention-torch-ext sage_attention-torch-ext> Running phase: installCheckPhase sage_attention-torch-ext> no Makefile or custom installCheckPhase, doing nothing sage_attention-torch-ext> Checking of ABI compatibility @@ -10829,661 +10538,987 @@ sage_attention-torch-ext> 🐍 Checking for compatibility with manylinux_2_28 an sage_attention-torch-ext> ✅ No compatibility issues found sage_attention-torch-ext> Checking loading kernel with get_kernel sage_attention-torch-ext> Check whether the kernel can be loaded with get-kernel: sage_attention -sage_attention-torch-ext> [11/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/qk_int_sv_f16_cuda_sm80.cu.o +sage_attention-torch-ext> [11/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/qk_int_sv_f16_cuda_sm80.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 584.776 ms +sage_attention-torch-ext> ptxas info : Compile time = 577.793 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 546.615 ms +sage_attention-torch-ext> ptxas info : Compile time = 540.100 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 588.242 ms +sage_attention-torch-ext> ptxas info : Compile time = 581.375 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 557.461 ms +sage_attention-torch-ext> ptxas info : Compile time = 549.599 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 457.510 ms +sage_attention-torch-ext> ptxas info : Compile time = 449.505 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 442.285 ms +sage_attention-torch-ext> ptxas info : Compile time = 436.913 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 574.289 ms +sage_attention-torch-ext> ptxas info : Compile time = 568.259 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 545.839 ms +sage_attention-torch-ext> ptxas info : Compile time = 539.045 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 590.187 ms +sage_attention-torch-ext> ptxas info : Compile time = 582.831 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 564.103 ms +sage_attention-torch-ext> ptxas info : Compile time = 556.473 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 601.630 ms +sage_attention-torch-ext> ptxas info : Compile time = 593.871 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 571.288 ms +sage_attention-torch-ext> ptxas info : Compile time = 567.721 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 590.861 ms +sage_attention-torch-ext> ptxas info : Compile time = 582.074 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 558.525 ms +sage_attention-torch-ext> ptxas info : Compile time = 553.902 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 597.586 ms +sage_attention-torch-ext> ptxas info : Compile time = 591.265 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 568.195 ms +sage_attention-torch-ext> ptxas info : Compile time = 561.780 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 227 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.563 ms +sage_attention-torch-ext> ptxas info : Compile time = 241.858 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 217 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 237.397 ms +sage_attention-torch-ext> ptxas info : Compile time = 234.288 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 267.101 ms +sage_attention-torch-ext> ptxas info : Compile time = 264.094 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 257.641 ms +sage_attention-torch-ext> ptxas info : Compile time = 255.340 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.210 ms +sage_attention-torch-ext> ptxas info : Compile time = 256.270 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 249.585 ms +sage_attention-torch-ext> ptxas info : Compile time = 247.355 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 202 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 270.182 ms +sage_attention-torch-ext> ptxas info : Compile time = 266.133 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 202 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 259.223 ms +sage_attention-torch-ext> ptxas info : Compile time = 255.430 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 268.912 ms +sage_attention-torch-ext> ptxas info : Compile time = 267.176 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 257.454 ms +sage_attention-torch-ext> ptxas info : Compile time = 254.734 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 285.952 ms +sage_attention-torch-ext> ptxas info : Compile time = 283.957 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 276.613 ms +sage_attention-torch-ext> ptxas info : Compile time = 275.099 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 277.856 ms +sage_attention-torch-ext> ptxas info : Compile time = 276.139 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 268.776 ms +sage_attention-torch-ext> ptxas info : Compile time = 267.366 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 194 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 286.969 ms +sage_attention-torch-ext> ptxas info : Compile time = 283.937 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 194 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 277.383 ms +sage_attention-torch-ext> ptxas info : Compile time = 274.177 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 185 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 198.436 ms +sage_attention-torch-ext> ptxas info : Compile time = 197.088 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 185 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 197.677 ms +sage_attention-torch-ext> ptxas info : Compile time = 196.570 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 246 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 208.144 ms +sage_attention-torch-ext> ptxas info : Compile time = 206.151 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 246 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 215.222 ms +sage_attention-torch-ext> ptxas info : Compile time = 213.025 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 176 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 217.948 ms +sage_attention-torch-ext> ptxas info : Compile time = 215.405 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 176 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 217.815 ms +sage_attention-torch-ext> ptxas info : Compile time = 217.156 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 184 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 224.268 ms +sage_attention-torch-ext> ptxas info : Compile time = 222.719 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 184 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 217.566 ms +sage_attention-torch-ext> ptxas info : Compile time = 215.432 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 180 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 222.772 ms +sage_attention-torch-ext> ptxas info : Compile time = 221.762 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 180 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 222.426 ms +sage_attention-torch-ext> ptxas info : Compile time = 220.206 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 226.758 ms +sage_attention-torch-ext> ptxas info : Compile time = 225.431 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 227.240 ms +sage_attention-torch-ext> ptxas info : Compile time = 225.805 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 221.637 ms +sage_attention-torch-ext> ptxas info : Compile time = 219.814 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 221.849 ms +sage_attention-torch-ext> ptxas info : Compile time = 220.003 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 178 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 226.190 ms +sage_attention-torch-ext> ptxas info : Compile time = 224.826 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 178 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 226.639 ms +sage_attention-torch-ext> ptxas info : Compile time = 224.445 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 266.983 ms +sage_attention-torch-ext> ptxas info : Compile time = 266.497 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 267.092 ms +sage_attention-torch-ext> ptxas info : Compile time = 264.516 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 273.940 ms +sage_attention-torch-ext> ptxas info : Compile time = 273.296 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 273.960 ms +sage_attention-torch-ext> ptxas info : Compile time = 271.292 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 246.620 ms +sage_attention-torch-ext> ptxas info : Compile time = 245.256 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 247.488 ms +sage_attention-torch-ext> ptxas info : Compile time = 244.541 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 274.226 ms +sage_attention-torch-ext> ptxas info : Compile time = 273.650 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 274.002 ms +sage_attention-torch-ext> ptxas info : Compile time = 270.650 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 286.593 ms +sage_attention-torch-ext> ptxas info : Compile time = 285.458 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 286.454 ms +sage_attention-torch-ext> ptxas info : Compile time = 285.065 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 293.260 ms +sage_attention-torch-ext> ptxas info : Compile time = 292.409 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 293.760 ms +sage_attention-torch-ext> ptxas info : Compile time = 292.646 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 284.209 ms +sage_attention-torch-ext> ptxas info : Compile time = 284.793 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 293.118 ms +sage_attention-torch-ext> ptxas info : Compile time = 292.497 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 300.652 ms +sage_attention-torch-ext> ptxas info : Compile time = 301.876 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 300.845 ms +sage_attention-torch-ext> ptxas info : Compile time = 300.200 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 549.938 ms +sage_attention-torch-ext> ptxas info : Compile time = 552.634 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 517.048 ms +sage_attention-torch-ext> ptxas info : Compile time = 518.553 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 550.878 ms +sage_attention-torch-ext> ptxas info : Compile time = 554.599 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 524.674 ms +sage_attention-torch-ext> ptxas info : Compile time = 526.619 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 430.526 ms +sage_attention-torch-ext> ptxas info : Compile time = 433.431 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 409.659 ms +sage_attention-torch-ext> ptxas info : Compile time = 410.252 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 542.771 ms +sage_attention-torch-ext> ptxas info : Compile time = 545.211 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 515.247 ms +sage_attention-torch-ext> ptxas info : Compile time = 517.720 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 559.112 ms +sage_attention-torch-ext> ptxas info : Compile time = 561.874 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 533.754 ms +sage_attention-torch-ext> ptxas info : Compile time = 533.685 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 571.632 ms +sage_attention-torch-ext> ptxas info : Compile time = 571.367 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 545.216 ms +sage_attention-torch-ext> ptxas info : Compile time = 544.202 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 559.200 ms +sage_attention-torch-ext> ptxas info : Compile time = 559.312 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 533.626 ms +sage_attention-torch-ext> ptxas info : Compile time = 532.679 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 571.547 ms +sage_attention-torch-ext> ptxas info : Compile time = 570.656 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 544.504 ms +sage_attention-torch-ext> ptxas info : Compile time = 542.606 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 201 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.386 ms +sage_attention-torch-ext> ptxas info : Compile time = 246.739 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 238.228 ms +sage_attention-torch-ext> ptxas info : Compile time = 237.643 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.757 ms +sage_attention-torch-ext> ptxas info : Compile time = 252.617 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 207 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 243.618 ms +sage_attention-torch-ext> ptxas info : Compile time = 243.258 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 204 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 246.223 ms +sage_attention-torch-ext> ptxas info : Compile time = 244.601 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 204 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 237.605 ms +sage_attention-torch-ext> ptxas info : Compile time = 236.755 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 252.583 ms +sage_attention-torch-ext> ptxas info : Compile time = 252.411 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.237 ms +sage_attention-torch-ext> ptxas info : Compile time = 242.202 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 263.119 ms +sage_attention-torch-ext> ptxas info : Compile time = 262.965 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.398 ms +sage_attention-torch-ext> ptxas info : Compile time = 253.665 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 270.412 ms +sage_attention-torch-ext> ptxas info : Compile time = 269.733 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 261.949 ms +sage_attention-torch-ext> ptxas info : Compile time = 261.897 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 263.559 ms +sage_attention-torch-ext> ptxas info : Compile time = 263.350 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.463 ms +sage_attention-torch-ext> ptxas info : Compile time = 254.564 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 271.281 ms +sage_attention-torch-ext> ptxas info : Compile time = 268.621 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 263.448 ms +sage_attention-torch-ext> ptxas info : Compile time = 260.465 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 483.972 ms +sage_attention-torch-ext> ptxas info : Compile time = 479.113 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 478.741 ms +sage_attention-torch-ext> ptxas info : Compile time = 473.189 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 492.247 ms +sage_attention-torch-ext> ptxas info : Compile time = 484.132 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 499.507 ms +sage_attention-torch-ext> ptxas info : Compile time = 488.505 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 478.163 ms +sage_attention-torch-ext> ptxas info : Compile time = 471.002 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 480.421 ms +sage_attention-torch-ext> ptxas info : Compile time = 474.086 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 478.013 ms +sage_attention-torch-ext> ptxas info : Compile time = 475.979 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 479.704 ms +sage_attention-torch-ext> ptxas info : Compile time = 475.733 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 60 bytes spill stores, 68 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 492.647 ms +sage_attention-torch-ext> ptxas info : Compile time = 490.581 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 60 bytes spill stores, 68 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 496.343 ms +sage_attention-torch-ext> ptxas info : Compile time = 488.138 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 112 bytes stack frame, 68 bytes spill stores, 76 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 506.130 ms +sage_attention-torch-ext> ptxas info : Compile time = 500.404 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 112 bytes stack frame, 68 bytes spill stores, 76 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 505.690 ms +sage_attention-torch-ext> ptxas info : Compile time = 499.312 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 56 bytes stack frame, 56 bytes spill stores, 60 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 499.671 ms +sage_attention-torch-ext> ptxas info : Compile time = 489.706 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 56 bytes stack frame, 56 bytes spill stores, 60 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 496.812 ms +sage_attention-torch-ext> ptxas info : Compile time = 490.358 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 96 bytes stack frame, 60 bytes spill stores, 64 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 507.126 ms +sage_attention-torch-ext> ptxas info : Compile time = 495.879 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 96 bytes stack frame, 60 bytes spill stores, 64 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 505.885 ms +sage_attention-torch-ext> ptxas info : Compile time = 496.330 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.743 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 245.881 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 247.385 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 252.723 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 252.201 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 245.891 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 246.237 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 253.792 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 253.898 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 265.262 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 264.953 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 272.009 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 272.403 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 264.150 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 264.396 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 272.029 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' +sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 272.217 ms +sage_attention-torch-ext> [12/12] Linking CXX shared module _sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> buildPhase completed in 3 minutes 45 seconds +sage_attention-torch-ext> Running phase: installPhase +sage_attention-torch-ext> install flags: -j21 install +sage_attention-torch-ext> [0/1] Install the project... +sage_attention-torch-ext> -- Install configuration: "Release" +sage_attention-torch-ext> -- Installing: /nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/_sage_attention_af2d0c0_dirty/_sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> [8/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 561.017 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 557.379 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 570.749 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 567.461 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 555.859 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 556.222 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 551.024 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 551.660 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 488.288 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 490.506 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 593.538 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 40 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 592.085 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 486.291 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 490.670 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 592.515 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 587.756 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 240.688 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 242.615 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 248.619 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 249.134 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 236 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 241.375 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 236 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 240.270 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 246.635 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 248.753 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 267.005 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 263.744 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 270.121 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 269.735 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 263.789 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 261.852 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 266.241 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 268.157 ms +sage_attention-torch-ext> Running phase: fixupPhase +sage_attention-torch-ext> shrinking RPATHs of ELF executables and libraries in /nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext +sage_attention-torch-ext> shrinking /nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext/sage_attention/_sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> [9/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_attn.cu.o +sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used +sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 438.883 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 433.179 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 533.744 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 530.959 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 522.545 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 525.052 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 520.863 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 518.957 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 456.437 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 459.820 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 556.469 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 554.382 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 456.700 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 456.712 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 552.185 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 550.740 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 225.932 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.403 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 239 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 227.209 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 258.148 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 233.616 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 257.065 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 232.576 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.576 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 236 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 227.053 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 249.934 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 236 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 226.230 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 257.861 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 229 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 229.246 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 257.579 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 229 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 226.943 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 269.519 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 245.074 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 269.729 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 241 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 243.551 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 277.652 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 250.879 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 277.765 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 250.206 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 268.882 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 243 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 244.705 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 269.865 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 243 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 245.472 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 276.602 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' -sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf +sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 251.829 ms +sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' +sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb0ELb0ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 276.939 ms -sage_attention-torch-ext> [12/12] Linking CXX shared module _sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> buildPhase completed in 3 minutes 45 seconds -sage_attention-torch-ext> Running phase: installPhase -sage_attention-torch-ext> install flags: -j21 install -sage_attention-torch-ext> [0/1] Install the project... -sage_attention-torch-ext> -- Install configuration: "Release" -sage_attention-torch-ext> -- Installing: /nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/_sage_attention_44b112f_dirty/_sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> Running phase: fixupPhase -sage_attention-torch-ext> shrinking RPATHs of ELF executables and libraries in /nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext -sage_attention-torch-ext> shrinking /nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext/sage_attention/_sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> checking for references to /build/ in /nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext... -sage_attention-torch-ext> patching script interpreter paths in /nix/store/x0vcv18dr2mcj5ih9i3aq3nshydimpca-sage_attention-torch-ext +sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 249.809 ms +sage_attention-torch-ext> checking for references to /build/ in /nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext... +sage_attention-torch-ext> patching script interpreter paths in /nix/store/mkd1kn188s2i4xnh80z6397w35dcn0b9-sage_attention-torch-ext sage_attention-torch-ext> Running phase: installCheckPhase sage_attention-torch-ext> no Makefile or custom installCheckPhase, doing nothing sage_attention-torch-ext> Checking of ABI compatibility @@ -11491,696 +11526,661 @@ sage_attention-torch-ext> 🐍 Checking for compatibility with manylinux_2_28 an sage_attention-torch-ext> ✅ No compatibility issues found sage_attention-torch-ext> Checking loading kernel with get_kernel sage_attention-torch-ext> Check whether the kernel can be loaded with get-kernel: sage_attention -sage_attention-torch-ext> [11/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/qk_int_sv_f16_cuda_sm80.cu.o +sage_attention-torch-ext> [11/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/qk_int_sv_f16_cuda_sm80.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 10 bytes gmem, 80 bytes cmem[4] +sage_attention-torch-ext> ptxas info : 11 bytes gmem, 88 bytes cmem[4] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 576.356 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 540.449 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 579.980 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 550.877 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 450.748 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 436.310 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 569.668 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 539.170 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 584.251 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 556.690 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 596.932 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 567.192 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 585.485 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 555.673 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 595.002 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 568.517 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 227 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 245.328 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 217 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 237.060 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 267.650 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 257.376 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 261.832 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 248.087 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 202 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 267.565 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 202 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 256.431 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 269.458 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 257.270 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 286.448 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 277.447 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 278.484 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 270.991 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 194 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 285.680 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 194 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 279.070 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 185 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 197.591 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 185 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 199.109 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 246 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 207.939 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 246 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 215.415 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 176 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 220.332 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 176 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 218.548 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 184 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 226.299 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 184 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 217.250 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 180 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 222.585 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 180 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 223.601 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 227.492 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 227.957 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 222.685 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 222.169 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 178 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 227.019 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 178 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] +sage_attention-torch-ext> ptxas info : Compile time = 226.143 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 269.444 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 267.393 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 275.941 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 274.105 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 247.487 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 246.803 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 275.106 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 274.792 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 287.780 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 287.554 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 294.568 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 294.883 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 285.699 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 292.333 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 302.119 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 300.418 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 555.557 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 519.550 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 551.355 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 524.203 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 429.088 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 406.496 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 541.189 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 515.693 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 561.090 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 535.625 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 571.384 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 541.083 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 557.912 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 531.773 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 569.410 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 541.200 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 201 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 244.572 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 237.413 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 250.707 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 207 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 243.286 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 204 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 243.345 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 204 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 234.974 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 250.862 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 241.827 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 263.864 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 253.610 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 270.959 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 260.788 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 262.670 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 254.072 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 269.637 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 260.484 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 479.896 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 476.006 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 489.100 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 495.375 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 472.317 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 474.508 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 475.359 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 473.822 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 60 bytes spill stores, 68 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 488.158 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 60 bytes spill stores, 68 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 486.883 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 112 bytes stack frame, 68 bytes spill stores, 76 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 500.040 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 112 bytes stack frame, 68 bytes spill stores, 76 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 501.215 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 56 bytes stack frame, 56 bytes spill stores, 60 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 487.436 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 56 bytes stack frame, 56 bytes spill stores, 60 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 484.705 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 96 bytes stack frame, 60 bytes spill stores, 64 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 493.841 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 96 bytes stack frame, 60 bytes spill stores, 64 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 498.617 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 246.558 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 246.318 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 253.831 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 252.628 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 245.847 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 246.382 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 256.103 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 258.114 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 268.177 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 269.103 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 274.647 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 274.547 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 266.275 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 265.226 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] +sage_attention-torch-ext> ptxas info : Compile time = 274.931 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> [9/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/sm89_qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn.cu.o -sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used -sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 540.842 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 537.877 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 550.018 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 12 bytes spill stores, 24 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 548.929 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 536.450 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 537.209 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 534.340 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 48 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 48 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 534.521 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 471.004 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 40 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 472.920 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 573.035 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 80 bytes stack frame, 48 bytes spill stores, 40 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 80 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 571.750 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 469.703 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 473.748 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 571.073 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 567.513 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 235.378 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 245 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 236.776 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.118 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 241.920 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 236 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 235.965 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 236 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 234.354 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 239.999 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode0ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 231 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.139 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 257.654 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 234 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 255.864 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.515 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 244 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.248 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 255.341 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb0ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 238 registers, used 1 barriers, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.405 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.494 ms -sage_attention-torch-ext> ptxas info : Compiling entry function '_Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf' for 'sm_89' -sage_attention-torch-ext> ptxas info : Function properties for _Z24qk_int_sv_f8_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit1EL8MaskMode1ELb1ELb1ELb1ELb0EEvPaS5_S5_PT9_PfS8_S8_S8_S8_jjjjjjjjjjjjjjjf -sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads -sage_attention-torch-ext> ptxas info : Used 233 registers, used 1 barriers, 32 bytes cumulative stack size, 488 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.957 ms -sage_attention-torch-ext> [12/12] Linking CXX shared module _sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> buildPhase completed in 3 minutes 49 seconds +sage_attention-torch-ext> ptxas info : Compile time = 272.085 ms +sage_attention-torch-ext> [12/12] Linking CXX shared module _sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> buildPhase completed in 3 minutes 56 seconds sage_attention-torch-ext> Running phase: installPhase sage_attention-torch-ext> install flags: -j21 install sage_attention-torch-ext> [0/1] Install the project... sage_attention-torch-ext> -- Install configuration: "Release" -sage_attention-torch-ext> -- Installing: /nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/_sage_attention_44b112f_dirty/_sage_attention_44b112f_dirty.abi3.so +sage_attention-torch-ext> -- Installing: /nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/_sage_attention_af2d0c0_dirty/_sage_attention_af2d0c0_dirty.abi3.so sage_attention-torch-ext> Running phase: fixupPhase -sage_attention-torch-ext> shrinking RPATHs of ELF executables and libraries in /nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext -sage_attention-torch-ext> shrinking /nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext/sage_attention/_sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> checking for references to /build/ in /nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext... -sage_attention-torch-ext> patching script interpreter paths in /nix/store/wqplrwp0m14wqjzmh3x18a5pgc2kcs8k-sage_attention-torch-ext +sage_attention-torch-ext> shrinking RPATHs of ELF executables and libraries in /nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext +sage_attention-torch-ext> shrinking /nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext/sage_attention/_sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> checking for references to /build/ in /nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext... +sage_attention-torch-ext> patching script interpreter paths in /nix/store/ki5ldbx0351svgxhqw7y30n8kbi51l55-sage_attention-torch-ext sage_attention-torch-ext> Running phase: installCheckPhase sage_attention-torch-ext> no Makefile or custom installCheckPhase, doing nothing sage_attention-torch-ext> Checking of ABI compatibility @@ -12188,1326 +12188,1326 @@ sage_attention-torch-ext> 🐍 Checking for compatibility with manylinux_2_28 an sage_attention-torch-ext> ✅ No compatibility issues found sage_attention-torch-ext> Checking loading kernel with get_kernel sage_attention-torch-ext> Check whether the kernel can be loaded with get-kernel: sage_attention -sage_attention-torch-ext> [10/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/fused/fused.cu.o +sage_attention-torch-ext> [10/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/fused/fused.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 28 bytes gmem sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 33.289 ms +sage_attention-torch-ext> ptxas info : Compile time = 33.288 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 29.922 ms +sage_attention-torch-ext> ptxas info : Compile time = 29.867 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 25.506 ms +sage_attention-torch-ext> ptxas info : Compile time = 25.591 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 24.229 ms +sage_attention-torch-ext> ptxas info : Compile time = 24.254 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 4.507 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.526 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 4.402 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.297 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 4.284 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.232 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 4.249 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.226 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 4.153 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.133 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 4.124 ms +sage_attention-torch-ext> ptxas info : Compile time = 4.130 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 3.441 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.458 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers -sage_attention-torch-ext> ptxas info : Compile time = 3.396 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.439 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.558 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.601 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.568 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.550 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.634 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.582 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.541 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.496 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.451 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.461 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.412 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.480 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.383 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.457 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.331 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.416 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 12.259 ms +sage_attention-torch-ext> ptxas info : Compile time = 12.196 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.726 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.650 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.702 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.684 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.676 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.661 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.785 ms +sage_attention-torch-ext> ptxas info : Compile time = 11.788 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.582 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.503 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.636 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.511 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 29 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.604 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.497 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.001 ms +sage_attention-torch-ext> ptxas info : Compile time = 10.839 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.682 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.546 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.202 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.139 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.571 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.510 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 10.582 ms +sage_attention-torch-ext> ptxas info : Compile time = 10.488 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.640 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.475 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.583 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.415 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.689 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.485 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.382 ms +sage_attention-torch-ext> ptxas info : Compile time = 11.177 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.123 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.894 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.179 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.915 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.029 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.774 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 11.805 ms +sage_attention-torch-ext> ptxas info : Compile time = 11.512 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.019 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.750 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 8.021 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.763 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_90a' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 28 registers, used 1 barriers, 132 bytes smem -sage_attention-torch-ext> ptxas info : Compile time = 7.987 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.726 ms sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 22.199 ms +sage_attention-torch-ext> ptxas info : Compile time = 22.671 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 20.236 ms +sage_attention-torch-ext> ptxas info : Compile time = 21.188 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 15.886 ms +sage_attention-torch-ext> ptxas info : Compile time = 16.891 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 15.927 ms +sage_attention-torch-ext> ptxas info : Compile time = 16.799 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.053 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.351 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.023 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.234 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.037 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.166 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 20 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.014 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.182 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.934 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.075 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.911 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.058 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.539 ms +sage_attention-torch-ext> ptxas info : Compile time = 2.681 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 20 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.515 ms +sage_attention-torch-ext> ptxas info : Compile time = 2.616 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.825 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.217 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.758 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.133 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.799 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.144 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.780 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.120 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.809 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.172 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.821 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.226 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.860 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.236 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.802 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.177 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.542 ms +sage_attention-torch-ext> ptxas info : Compile time = 11.098 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.751 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.254 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.764 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.282 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.734 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.305 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.465 ms +sage_attention-torch-ext> ptxas info : Compile time = 11.128 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.805 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.358 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.770 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.256 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.803 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.287 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.826 ms +sage_attention-torch-ext> ptxas info : Compile time = 10.295 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.861 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.231 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.836 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.179 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.775 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.187 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.283 ms +sage_attention-torch-ext> ptxas info : Compile time = 9.794 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.797 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.276 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.852 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.250 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.792 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.157 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.622 ms +sage_attention-torch-ext> ptxas info : Compile time = 10.079 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.937 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.330 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.951 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.326 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.403 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.805 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 32 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.823 ms +sage_attention-torch-ext> ptxas info : Compile time = 10.283 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.956 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.335 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.932 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.300 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.920 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.309 ms sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 29.989 ms +sage_attention-torch-ext> ptxas info : Compile time = 30.539 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb1E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 39 registers, used 1 barriers, 392 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 27.743 ms +sage_attention-torch-ext> ptxas info : Compile time = 28.640 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E13__nv_bfloat16EvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 21.940 ms +sage_attention-torch-ext> ptxas info : Compile time = 22.592 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15MeanScaleKernelILj64ELb0E6__halfEvPT1_PaPfS4_fjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 260 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 22.046 ms +sage_attention-torch-ext> ptxas info : Compile time = 22.658 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.158 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.208 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E13__nv_bfloat16EvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.085 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.167 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj128ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 21 registers, used 1 barriers, 32768 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.102 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.192 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z25TransposePadPermuteKernelILj64ELj64ELb1E6__halfEvPT2_S2_jjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 19 registers, used 1 barriers, 16384 bytes smem, 396 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 3.056 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.160 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.995 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.087 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E13__nv_bfloat16EvPT2_S2_P6__halfjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 16 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.998 ms +sage_attention-torch-ext> ptxas info : Compile time = 3.044 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj128ELj64ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.554 ms +sage_attention-torch-ext> ptxas info : Compile time = 2.654 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z13SubMeanKernelILj64ELj128ELj1E6__halfEvPT2_S2_PS0_jjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 18 registers, used 0 barriers, 412 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 2.556 ms +sage_attention-torch-ext> ptxas info : Compile time = 2.636 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.914 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.138 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.864 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.063 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.853 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.123 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.860 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.089 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.920 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.168 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj32ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.835 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.066 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.938 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.157 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj16ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.867 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.096 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 38 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.520 ms +sage_attention-torch-ext> ptxas info : Compile time = 10.879 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.801 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.060 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 25 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.806 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.066 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.807 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.052 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 40 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 10.550 ms +sage_attention-torch-ext> ptxas info : Compile time = 10.882 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.905 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.184 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.925 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.184 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb1E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 8.436 ms +sage_attention-torch-ext> ptxas info : Compile time = 8.697 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.396 ms +sage_attention-torch-ext> ptxas info : Compile time = 9.729 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.844 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.081 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.907 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.101 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.863 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.084 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.293 ms +sage_attention-torch-ext> ptxas info : Compile time = 9.615 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.874 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.109 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.941 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.158 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb0ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.908 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.146 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.847 ms +sage_attention-torch-ext> ptxas info : Compile time = 10.199 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.972 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.248 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.040 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.267 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E13__nv_bfloat16EvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.524 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.713 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj128ELj2ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 33 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 9.966 ms +sage_attention-torch-ext> ptxas info : Compile time = 11.405 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj128ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 6.990 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.241 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj128ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 27 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.034 ms +sage_attention-torch-ext> ptxas info : Compile time = 7.283 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj' for 'sm_89' sage_attention-torch-ext> ptxas info : Function properties for _Z15QuantInt8KernelILj64ELj64ELj1ELb1ELb0E6__halfEvPT4_S2_PaPffjjjjjjjjjjj sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 26 registers, used 1 barriers, 132 bytes smem, 432 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 7.001 ms -sage_attention-torch-ext> [11/12] Building CUDA object CMakeFiles/_sage_attention_44b112f_dirty.dir/sage_attention/qattn/qk_int_sv_f16_cuda_sm80.cu.o +sage_attention-torch-ext> ptxas info : Compile time = 7.236 ms +sage_attention-torch-ext> [11/12] Building CUDA object CMakeFiles/_sage_attention_af2d0c0_dirty.dir/sage_attention/qattn/qk_int_sv_f16_cuda_sm80.cu.o sage_attention-torch-ext> nvcc warning : incompatible redefinition for option 'threads', the last value of this option was used sage_attention-torch-ext> ptxas info : 28 bytes gmem, 224 bytes cmem[4] sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 545.701 ms +sage_attention-torch-ext> ptxas info : Compile time = 571.316 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 248 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 516.375 ms +sage_attention-torch-ext> ptxas info : Compile time = 536.617 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 549.037 ms +sage_attention-torch-ext> ptxas info : Compile time = 571.602 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 527.582 ms +sage_attention-torch-ext> ptxas info : Compile time = 552.467 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 465.316 ms +sage_attention-torch-ext> ptxas info : Compile time = 483.713 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 449.153 ms +sage_attention-torch-ext> ptxas info : Compile time = 463.155 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 472.189 ms +sage_attention-torch-ext> ptxas info : Compile time = 483.172 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 447.628 ms +sage_attention-torch-ext> ptxas info : Compile time = 458.046 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 487.618 ms +sage_attention-torch-ext> ptxas info : Compile time = 500.637 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 465.347 ms +sage_attention-torch-ext> ptxas info : Compile time = 476.593 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 567.223 ms +sage_attention-torch-ext> ptxas info : Compile time = 583.442 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 547.253 ms +sage_attention-torch-ext> ptxas info : Compile time = 557.485 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 492.125 ms +sage_attention-torch-ext> ptxas info : Compile time = 501.026 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 465.637 ms +sage_attention-torch-ext> ptxas info : Compile time = 475.652 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 497.336 ms +sage_attention-torch-ext> ptxas info : Compile time = 509.803 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 474.373 ms +sage_attention-torch-ext> ptxas info : Compile time = 481.269 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 227 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 236.373 ms +sage_attention-torch-ext> ptxas info : Compile time = 240.480 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 217 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 223.327 ms +sage_attention-torch-ext> ptxas info : Compile time = 228.315 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 249.079 ms +sage_attention-torch-ext> ptxas info : Compile time = 255.751 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 241.504 ms +sage_attention-torch-ext> ptxas info : Compile time = 247.640 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.967 ms +sage_attention-torch-ext> ptxas info : Compile time = 248.706 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.799 ms +sage_attention-torch-ext> ptxas info : Compile time = 247.183 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 202 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 258.238 ms +sage_attention-torch-ext> ptxas info : Compile time = 263.317 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 202 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 248.402 ms +sage_attention-torch-ext> ptxas info : Compile time = 253.588 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 259.456 ms +sage_attention-torch-ext> ptxas info : Compile time = 265.679 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 250.567 ms +sage_attention-torch-ext> ptxas info : Compile time = 256.168 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 278.934 ms +sage_attention-torch-ext> ptxas info : Compile time = 284.953 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 269.049 ms +sage_attention-torch-ext> ptxas info : Compile time = 275.347 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 270.213 ms +sage_attention-torch-ext> ptxas info : Compile time = 274.944 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.206 ms +sage_attention-torch-ext> ptxas info : Compile time = 258.335 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 194 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 277.538 ms +sage_attention-torch-ext> ptxas info : Compile time = 282.089 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb1EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 194 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 269.246 ms +sage_attention-torch-ext> ptxas info : Compile time = 273.193 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 185 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 199.821 ms +sage_attention-torch-ext> ptxas info : Compile time = 203.289 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 185 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 199.606 ms +sage_attention-torch-ext> ptxas info : Compile time = 203.666 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 246 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 209.401 ms +sage_attention-torch-ext> ptxas info : Compile time = 212.717 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 246 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 210.623 ms +sage_attention-torch-ext> ptxas info : Compile time = 214.296 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 176 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 211.729 ms +sage_attention-torch-ext> ptxas info : Compile time = 216.050 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 176 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 211.557 ms +sage_attention-torch-ext> ptxas info : Compile time = 215.456 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 184 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 210.827 ms +sage_attention-torch-ext> ptxas info : Compile time = 214.066 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 184 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 209.703 ms +sage_attention-torch-ext> ptxas info : Compile time = 212.810 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 180 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 216.049 ms +sage_attention-torch-ext> ptxas info : Compile time = 220.633 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 180 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 216.295 ms +sage_attention-torch-ext> ptxas info : Compile time = 220.663 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 221.361 ms +sage_attention-torch-ext> ptxas info : Compile time = 225.886 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 222.072 ms +sage_attention-torch-ext> ptxas info : Compile time = 226.663 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 216.433 ms +sage_attention-torch-ext> ptxas info : Compile time = 220.512 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 177 registers, used 1 barriers, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 216.431 ms +sage_attention-torch-ext> ptxas info : Compile time = 220.524 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 178 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 219.990 ms +sage_attention-torch-ext> ptxas info : Compile time = 224.866 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj16ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 178 registers, used 1 barriers, 16 bytes cumulative stack size, 480 bytes cmem[0] -sage_attention-torch-ext> ptxas info : Compile time = 220.584 ms +sage_attention-torch-ext> ptxas info : Compile time = 225.433 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 258.976 ms +sage_attention-torch-ext> ptxas info : Compile time = 264.182 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 257.879 ms +sage_attention-torch-ext> ptxas info : Compile time = 263.453 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.195 ms +sage_attention-torch-ext> ptxas info : Compile time = 269.958 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 249 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 265.276 ms +sage_attention-torch-ext> ptxas info : Compile time = 269.032 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 241.031 ms +sage_attention-torch-ext> ptxas info : Compile time = 244.684 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 241.054 ms +sage_attention-torch-ext> ptxas info : Compile time = 244.518 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 263.473 ms +sage_attention-torch-ext> ptxas info : Compile time = 269.924 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 262.351 ms +sage_attention-torch-ext> ptxas info : Compile time = 269.846 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 275.256 ms +sage_attention-torch-ext> ptxas info : Compile time = 282.806 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 273.759 ms +sage_attention-torch-ext> ptxas info : Compile time = 282.493 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 281.644 ms +sage_attention-torch-ext> ptxas info : Compile time = 289.549 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 282.248 ms +sage_attention-torch-ext> ptxas info : Compile time = 289.213 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 274.738 ms +sage_attention-torch-ext> ptxas info : Compile time = 281.922 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 274.730 ms +sage_attention-torch-ext> ptxas info : Compile time = 281.915 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 282.169 ms +sage_attention-torch-ext> ptxas info : Compile time = 289.308 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb1E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 282.613 ms +sage_attention-torch-ext> ptxas info : Compile time = 288.872 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 445.454 ms +sage_attention-torch-ext> ptxas info : Compile time = 461.987 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 253 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 429.332 ms +sage_attention-torch-ext> ptxas info : Compile time = 443.316 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 530.361 ms +sage_attention-torch-ext> ptxas info : Compile time = 545.054 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 435.991 ms +sage_attention-torch-ext> ptxas info : Compile time = 444.175 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 444.905 ms +sage_attention-torch-ext> ptxas info : Compile time = 456.380 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 421.719 ms +sage_attention-torch-ext> ptxas info : Compile time = 432.964 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 512.094 ms +sage_attention-torch-ext> ptxas info : Compile time = 525.958 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 488.322 ms +sage_attention-torch-ext> ptxas info : Compile time = 500.678 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 461.929 ms +sage_attention-torch-ext> ptxas info : Compile time = 475.069 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 440.182 ms +sage_attention-torch-ext> ptxas info : Compile time = 450.586 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 472.256 ms +sage_attention-torch-ext> ptxas info : Compile time = 480.912 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 254 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 449.929 ms +sage_attention-torch-ext> ptxas info : Compile time = 458.569 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 460.100 ms +sage_attention-torch-ext> ptxas info : Compile time = 470.436 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 250 registers, used 1 barriers, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 437.509 ms +sage_attention-torch-ext> ptxas info : Compile time = 448.004 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 471.040 ms +sage_attention-torch-ext> ptxas info : Compile time = 481.605 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 252 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 449.202 ms +sage_attention-torch-ext> ptxas info : Compile time = 457.319 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 201 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 237.683 ms +sage_attention-torch-ext> ptxas info : Compile time = 242.204 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 228.527 ms +sage_attention-torch-ext> ptxas info : Compile time = 232.154 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 205 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 241.745 ms +sage_attention-torch-ext> ptxas info : Compile time = 246.571 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 207 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 232.871 ms +sage_attention-torch-ext> ptxas info : Compile time = 238.084 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 204 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 235.932 ms +sage_attention-torch-ext> ptxas info : Compile time = 239.558 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 204 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 227.876 ms +sage_attention-torch-ext> ptxas info : Compile time = 230.949 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.745 ms +sage_attention-torch-ext> ptxas info : Compile time = 245.820 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 234.589 ms +sage_attention-torch-ext> ptxas info : Compile time = 237.532 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.400 ms +sage_attention-torch-ext> ptxas info : Compile time = 257.107 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.344 ms +sage_attention-torch-ext> ptxas info : Compile time = 249.043 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.897 ms +sage_attention-torch-ext> ptxas info : Compile time = 265.070 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 199 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 252.860 ms +sage_attention-torch-ext> ptxas info : Compile time = 257.008 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.826 ms +sage_attention-torch-ext> ptxas info : Compile time = 258.839 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 198 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 245.663 ms +sage_attention-torch-ext> ptxas info : Compile time = 250.896 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS6_PS2_PT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 261.039 ms +sage_attention-torch-ext> ptxas info : Compile time = 266.332 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2E6__halfLb0ES2_L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 200 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 251.908 ms +sage_attention-torch-ext> ptxas info : Compile time = 257.645 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 460.584 ms +sage_attention-torch-ext> ptxas info : Compile time = 469.718 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 458.619 ms +sage_attention-torch-ext> ptxas info : Compile time = 468.044 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 475.564 ms +sage_attention-torch-ext> ptxas info : Compile time = 486.454 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 32 bytes spill stores, 40 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 476.445 ms +sage_attention-torch-ext> ptxas info : Compile time = 486.216 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 461.818 ms +sage_attention-torch-ext> ptxas info : Compile time = 471.571 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 463.274 ms +sage_attention-torch-ext> ptxas info : Compile time = 472.500 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 464.385 ms +sage_attention-torch-ext> ptxas info : Compile time = 473.836 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 28 bytes spill stores, 32 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 466.754 ms +sage_attention-torch-ext> ptxas info : Compile time = 474.836 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 60 bytes spill stores, 68 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 481.417 ms +sage_attention-torch-ext> ptxas info : Compile time = 490.383 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 64 bytes stack frame, 60 bytes spill stores, 68 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 64 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 478.971 ms +sage_attention-torch-ext> ptxas info : Compile time = 490.178 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 112 bytes stack frame, 68 bytes spill stores, 76 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 488.346 ms +sage_attention-torch-ext> ptxas info : Compile time = 499.579 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 112 bytes stack frame, 68 bytes spill stores, 76 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 112 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 489.099 ms +sage_attention-torch-ext> ptxas info : Compile time = 500.229 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 56 bytes stack frame, 56 bytes spill stores, 60 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 478.063 ms +sage_attention-torch-ext> ptxas info : Compile time = 489.701 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 56 bytes stack frame, 56 bytes spill stores, 60 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 56 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 477.882 ms +sage_attention-torch-ext> ptxas info : Compile time = 489.876 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 96 bytes stack frame, 60 bytes spill stores, 64 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 487.670 ms +sage_attention-torch-ext> ptxas info : Compile time = 498.759 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj128EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 96 bytes stack frame, 60 bytes spill stores, 64 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 96 bytes cumulative stack size, 480 bytes cmem[0], 16 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 485.269 ms +sage_attention-torch-ext> ptxas info : Compile time = 497.826 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 239.050 ms +sage_attention-torch-ext> ptxas info : Compile time = 245.600 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 236.372 ms +sage_attention-torch-ext> ptxas info : Compile time = 245.136 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 241.725 ms +sage_attention-torch-ext> ptxas info : Compile time = 251.077 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.865 ms +sage_attention-torch-ext> ptxas info : Compile time = 250.828 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 235.112 ms +sage_attention-torch-ext> ptxas info : Compile time = 242.714 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 235.454 ms +sage_attention-torch-ext> ptxas info : Compile time = 242.881 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.402 ms +sage_attention-torch-ext> ptxas info : Compile time = 249.941 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode0ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 242.226 ms +sage_attention-torch-ext> ptxas info : Compile time = 249.919 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 254.271 ms +sage_attention-torch-ext> ptxas info : Compile time = 262.039 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.270 ms +sage_attention-torch-ext> ptxas info : Compile time = 261.386 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.606 ms +sage_attention-torch-ext> ptxas info : Compile time = 270.534 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity3ELS1_3EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 259.949 ms +sage_attention-torch-ext> ptxas info : Compile time = 269.921 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.865 ms +sage_attention-torch-ext> ptxas info : Compile time = 262.095 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb0ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 253.525 ms +sage_attention-torch-ext> ptxas info : Compile time = 261.711 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E13__nv_bfloat16L11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_P6__halfPT9_PfSA_SA_S9_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 260.816 ms +sage_attention-torch-ext> ptxas info : Compile time = 269.020 ms sage_attention-torch-ext> ptxas info : Compiling entry function '_Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf' for 'sm_80' sage_attention-torch-ext> ptxas info : Function properties for _Z25qk_int_sv_f16_attn_kernelILj128ELj64ELj32ELj64ELj64EL8DataType1EL16QuantGranularity2ELS1_2EfLb0E6__halfL11ComputeUnit0EL8MaskMode1ELb1ELb0EEvPaS5_PS2_PT9_PfS9_S9_S8_jjjjjjjjjjjjjjjf sage_attention-torch-ext> 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads sage_attention-torch-ext> ptxas info : Used 255 registers, used 1 barriers, 32 bytes cumulative stack size, 480 bytes cmem[0], 8 bytes cmem[2] -sage_attention-torch-ext> ptxas info : Compile time = 259.752 ms -sage_attention-torch-ext> [12/12] Linking CXX shared module _sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> buildPhase completed in 5 minutes 7 seconds +sage_attention-torch-ext> ptxas info : Compile time = 268.515 ms +sage_attention-torch-ext> [12/12] Linking CXX shared module _sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> buildPhase completed in 5 minutes 6 seconds sage_attention-torch-ext> Running phase: installPhase sage_attention-torch-ext> install flags: -j21 install sage_attention-torch-ext> [0/1] Install the project... sage_attention-torch-ext> -- Install configuration: "Release" -sage_attention-torch-ext> -- Installing: /nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/_sage_attention_44b112f_dirty/_sage_attention_44b112f_dirty.abi3.so +sage_attention-torch-ext> -- Installing: /nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/_sage_attention_af2d0c0_dirty/_sage_attention_af2d0c0_dirty.abi3.so sage_attention-torch-ext> Running phase: fixupPhase -sage_attention-torch-ext> shrinking RPATHs of ELF executables and libraries in /nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext -sage_attention-torch-ext> shrinking /nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext/sage_attention/_sage_attention_44b112f_dirty.abi3.so -sage_attention-torch-ext> checking for references to /build/ in /nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext... -sage_attention-torch-ext> patching script interpreter paths in /nix/store/vas14gsv17kdkap3r9szvjsclanmwq25-sage_attention-torch-ext +sage_attention-torch-ext> shrinking RPATHs of ELF executables and libraries in /nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext +sage_attention-torch-ext> shrinking /nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext/sage_attention/_sage_attention_af2d0c0_dirty.abi3.so +sage_attention-torch-ext> checking for references to /build/ in /nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext... +sage_attention-torch-ext> patching script interpreter paths in /nix/store/zrx3aflqjvr10nv91lgyfynpa623nsha-sage_attention-torch-ext sage_attention-torch-ext> Running phase: installCheckPhase sage_attention-torch-ext> no Makefile or custom installCheckPhase, doing nothing sage_attention-torch-ext> Checking of ABI compatibility @@ -13515,5 +13515,5 @@ sage_attention-torch-ext> 🐍 Checking for compatibility with manylinux_2_28 an sage_attention-torch-ext> ✅ No compatibility issues found sage_attention-torch-ext> Checking loading kernel with get_kernel sage_attention-torch-ext> Check whether the kernel can be loaded with get-kernel: sage_attention -building '/nix/store/xq28asxbqp6g7x8bcz92xl849prg2899-torch-ext-bundle.drv'... -building '/nix/store/rkzh9xwk6kdgl1by4xfwmyvb5arpfqby-build-and-copy.drv'... +building '/nix/store/zcfc2w942q3a6lpp77cmz64zdis9i1dz-torch-ext-bundle.drv'... +building '/nix/store/q2d20wl8cfvw82mp757i59cvq8z9wmpv-build-and-copy.drv'...